v1otusc
daily update
Robotics 42
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations
Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.
☆ Non-Contact Manipulation of Induced Magnetic Dipoles
Extending the field of magnetic manipulation to conductive, non-magnetic objects opens the door for a wide array of applications previously limited to hard or soft magnetic materials. Of particular interest is the recycling of space debris through the use of oscillating magnetic fields, which represent a cache of raw materials in an environment particularly suited to the low forces generated from inductive magnetic manipulation. Building upon previous work that demonstrated 3D open-loop position control by leveraging the opposing dipole moment created from induced eddy currents, this work demonstrates closed-loop position control of a semi-buoyant aluminum sphere in lab tests, and the efficacy of varying methods for force inversion is explored. The closed-loop methods represent a critical first step towards wider applications for 3-DOF position control of induced magnetic dipoles.
☆ Many-vs-Many Missile Guidance via Virtual Targets
This paper presents a novel approach to many-vs-many missile guidance using virtual targets (VTs) generated by a Normalizing Flows-based trajectory predictor. Rather than assigning n interceptors directly to m physical targets through conventional weapon target assignment algorithms, we propose a centralized strategy that constructs n VT trajectories representing probabilistic predictions of maneuvering target behavior. Each interceptor is guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse flight, transitioning to Proportional Navigation guidance for terminal interception. This approach treats many-vs-many engagements as many-vs-distribution scenarios, exploiting numerical superiority (n > m) by distributing interceptors across diverse trajectory hypotheses rather than pursuing identical deterministic predictions. Monte Carlo simulations across various target-interceptor configurations (1-6 targets, 1-8 interceptors) demonstrate that the VT method matches or exceeds baseline straight-line prediction performance by 0-4.1% when n = m, with improvements increasing to 5.8-14.4% when n > m. The results confirm that probabilistic VTs enable effective exploitation of numerical superiority, significantly increasing interception probability in many-vs-many scenarios.
comment: will be submitted to Journal of Guidance, Control, and Dynamics as Technical Note
☆ Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.
comment: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025 International Conference, PARIS, FRANCE December 3-5 2025
☆ Dexterous Robotic Piano Playing at Scale
Endowing robot hands with human-level dexterity has been a long-standing goal in robotics. Bimanual robotic piano playing represents a particularly challenging task: it is high-dimensional, contact-rich, and requires fast, precise control. We present OmniPianist, the first agent capable of performing nearly one thousand music pieces via scalable, human-demonstration-free learning. Our approach is built on three core components. First, we introduce an automatic fingering strategy based on Optimal Transport (OT), allowing the agent to autonomously discover efficient piano-playing strategies from scratch without demonstrations. Second, we conduct large-scale Reinforcement Learning (RL) by training more than 2,000 agents, each specialized in distinct music pieces, and aggregate their experience into a dataset named RP1M++, consisting of over one million trajectories for robotic piano playing. Finally, we employ a Flow Matching Transformer to leverage RP1M++ through large-scale imitation learning, resulting in the OmniPianist agent capable of performing a wide range of musical pieces. Extensive experiments and ablation studies highlight the effectiveness and scalability of our approach, advancing dexterous robotic piano playing at scale.
☆ From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
☆ Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.
☆ Whole-body motion planning and safety-critical control for aerial manipulation
Aerial manipulation combines the maneuverability of multirotors with the dexterity of robotic arms to perform complex tasks in cluttered spaces. Yet planning safe, dynamically feasible trajectories remains difficult due to whole-body collision avoidance and the conservativeness of common geometric abstractions such as bounding boxes or ellipsoids. We present a whole-body motion planning and safety-critical control framework for aerial manipulators built on superquadrics (SQs). Using an SQ-plus-proxy representation, we model both the vehicle and obstacles with differentiable, geometry-accurate surfaces. Leveraging this representation, we introduce a maximum-clearance planner that fuses Voronoi diagrams with an equilibrium-manifold formulation to generate smooth, collision-aware trajectories. We further design a safety-critical controller that jointly enforces thrust limits and collision avoidance via high-order control barrier functions. In simulation, our approach outperforms sampling-based planners in cluttered environments, producing faster, safer, and smoother trajectories and exceeding ellipsoid-based baselines in geometric fidelity. Actual experiments on a physical aerial-manipulation platform confirm feasibility and robustness, demonstrating consistent performance across simulation and hardware settings. The video can be found at https://youtu.be/hQYKwrWf1Ak.
comment: Submitted to 2026 IFAC World Congress with the Journal option (MECHATRONICS)
☆ Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization NeurIPS 2025
We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) -- originally developed for group synchronization -- to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone -- without access to inter-camera distances -- suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.
comment: NeurIPS 2025 spotlight paper
☆ ZJUNlict Extended Team Description Paper 2025
This paper presents the ZJUNlict team's work over the past year, covering both hardware and software advancements. In the hardware domain, the integration of an IMU into the v2023 robot was completed to enhance posture accuracy and angular velocity planning. On the software side, key modules were optimized, including the strategy and CUDA modules, with significant improvements in decision making efficiency, ball pursuit prediction, and ball possession prediction to adapt to high-tempo game dynamics.
☆ SuckTac: Camera-based Tactile Sucker for Unstructured Surface Perception and Interaction
Suckers are significant for robots in picking, transferring, manipulation and locomotion on diverse surfaces. However, most of the existing suckers lack high-fidelity perceptual and tactile sensing, which impedes them from resolving the fine-grained geometric features and interaction status of the target surface. This limits their robust performance with irregular objects and in complex, unstructured environments. Inspired by the adaptive structure and high-performance sensory capabilities of cephalopod suckers, in this paper, we propose a novel, intelligent sucker, named SuckTac, that integrates a camera-based tactile sensor directly within its optimized structure to provide high-density perception and robust suction. Specifically, through joint structure design and optimization and based on a multi-material integrated casting technique, a camera and light source are embedded into the sucker, which enables in-situ, high-density perception of fine details like surface shape, texture and roughness. To further enhance robustness and adaptability, the sucker's mechanical design is also optimized by refining its profile, adding a compliant lip, and incorporating surface microstructure. Extensive experiments, including challenging tasks such as robotic cloth manipulation and soft mobile robot inspection, demonstrate the superior performance and broad applicability of the proposed system.
☆ LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/
comment: Preprint. Project page: https://vla2026.github.io/LACY/
☆ A Quantitative Comparison of Centralised and Distributed Reinforcement Learning-Based Control for Soft Robotic Arms
This paper presents a quantitative comparison between centralised and distributed multi-agent reinforcement learning (MARL) architectures for controlling a soft robotic arm modelled as a Cosserat rod in simulation. Using PyElastica and the OpenAI Gym interface, we train both a global Proximal Policy Optimisation (PPO) controller and a Multi-Agent PPO (MAPPO) under identical budgets. Both approaches are based on the arm having $n$ number of controlled sections. The study systematically varies $n$ and evaluates the performance of the arm to reach a fixed target in three scenarios: default baseline condition, recovery from external disturbance, and adaptation to actuator failure. Quantitative metrics used for the evaluation are mean action magnitude, mean final distance, mean episode length, and success rate. The results show that there are no significant benefits of the distributed policy when the number of controlled sections $n\le4$. In very simple systems, when $n\le2$, the centralised policy outperforms the distributed one. When $n$ increases to $4< n\le 12$, the distributed policy shows a high sample efficiency. In these systems, distributed policy promotes a stronger success rate, resilience, and robustness under local observability and yields faster convergence given the same sample size. However, centralised policies achieve much higher time efficiency during training as it takes much less time to train the same size of samples. These findings highlight the trade-offs between centralised and distributed policy in reinforcement learning-based control for soft robotic systems and provide actionable design guidance for future sim-to-real transfer in soft rod-like manipulators.
comment: 7 pages, 4 figures, 2 tables, submitted to RoboSoft 2026
☆ Kinematic and Ergonomic Design of a Robotic Arm for Precision Laparoscopic Surgery
Robotic assistance in minimally invasive surgery can greatly enhance surgical precision and reduce surgeon fatigue. This paper presents a focused investigation on the kinematic and ergonomic design principles for a laparoscopic surgical robotic arm aimed at high-precision tasks. We propose a 7-degree-of-freedom (7-DOF) robotic arm system that incorporates a remote center of motion (RCM) at the instrument insertion point and ergonomic considerations to improve surgeon interaction. The design is implemented on a general-purpose robotic platform, and a series of simulated surgical tasks were performed to evaluate targeting accuracy, task efficiency, and surgeon comfort compared to conventional manual laparoscopy. Experimental results demonstrate that the optimized robotic design achieves significantly improved targeting accuracy (error reduced by over 50%) and shorter task completion times, while substantially lowering operator muscle strain and discomfort. These findings validate the importance of kinematic optimization (such as added articulations and tremor filtering) and human-centered ergonomic design in enhancing the performance of robot-assisted surgery. The insights from this work can guide the development of next-generation surgical robots that improve surgical outcomes and ergonomics for the operating team.
☆ Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models NeurIPS 2025
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
comment: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
☆ Census-Based Population Autonomy For Distributed Robotic Teaming
Collaborating teams of robots show promise due in their ability to complete missions more efficiently and with improved robustness, attributes that are particularly useful for systems operating in marine environments. A key issue is how to model, analyze, and design these multi-robot systems to realize the full benefits of collaboration, a challenging task since the domain of multi-robot autonomy encompasses both collective and individual behaviors. This paper introduces a layered model of multi-robot autonomy that uses the principle of census, or a weighted count of the inputs from neighbors, for collective decision-making about teaming, coupled with multi-objective behavior optimization for individual decision-making about actions. The census component is expressed as a nonlinear opinion dynamics model and the multi-objective behavior optimization is accomplished using interval programming. This model can be reduced to recover foundational algorithms in distributed optimization and control, while the full model enables new types of collective behaviors that are useful in real-world scenarios. To illustrate these points, a new method for distributed optimization of subgroup allocation is introduced where robots use a gradient descent algorithm to minimize portions of the cost functions that are locally known, while being influenced by the opinion states from neighbors to account for the unobserved costs. With this method the group can collectively use the information contained in the Hessian matrix of the total global cost. The utility of this model is experimentally validated in three categorically different experiments with fleets of autonomous surface vehicles: an adaptive sampling scenario, a high value unit protection scenario, and a competitive game of capture the flag.
comment: 16 pages, 17 figures
♻ ☆ Using Fiber Optic Bundles to Miniaturize Vision-Based Tactile Sensors
Vision-based tactile sensors have recently become popular due to their combination of low cost, very high spatial resolution, and ease of integration using widely available miniature cameras. The associated field of view and focal length, however, are difficult to package in a human-sized finger. In this paper we employ optical fiber bundles to achieve a form factor that, at 15 mm diameter, is smaller than an average human fingertip. The electronics and camera are also located remotely, further reducing package size. The sensor achieves a spatial resolution of 0.22 mm and a minimum force resolution 5 mN for normal and shear contact forces. With these attributes, the DIGIT Pinki sensor is suitable for applications such as robotic and teleoperated digital palpation. We demonstrate its utility for palpation of the prostate gland and show that it can achieve clinically relevant discrimination of prostate stiffness for phantom and ex vivo tissue.
comment: This work has been submitted to the IEEE for possible publication. The CAD design files of DIGIT Pinki are available at https://github.com/facebookresearch/digit-design
♻ ☆ Tactile Displays Driven by Projected Light
Tactile displays that lend tangible form to digital content could transform computing interactions. However, achieving the resolution, speed, and dynamic range needed for perceptual fidelity remains challenging. We present a tactile display that directly converts projected light into visible tactile patterns via a photomechanical surface populated with millimeter-scale optotactile pixels. The pixels transduce incident light into mechanical displacements through photostimulated thermal gas expansion, yielding millimeter scale displacements with response times of 2 to 100 milliseconds. Employing projected light for power transmission and addressing renders these displays highly scalable. We demonstrate optically driven displays with up to 1,511 addressable pixels -- several times more pixels than any prior tactile display attaining comparable performance. Perceptual studies confirm that these displays can reproduce diverse spatiotemporal tactile patterns with high fidelity. This research establishes a foundation for practical, versatile high-resolution tactile displays driven by light.
♻ ☆ Replicating Human Anatomy with Vision Controlled Jetting -- A Pneumatic Musculoskeletal Hand and Forearm
The functional replication and actuation of complex structures inspired by nature is a longstanding goal for humanity. Creating such complex structures combining soft and rigid features and actuating them with artificial muscles would further our understanding of natural kinematic structures. We printed a biomimetic hand in a single print process comprised of a rigid skeleton, soft joint capsules, tendons, and printed touch sensors. We showed it's actuation using electric motors. In this work, we expand on this work by adding a forearm that is also closely modeled after the human anatomy and replacing the hand's motors with 22 independently controlled pneumatic artificial muscles (PAMs). Our thin, high-strain (up to 30.1%) PAMs match the performance of state-of-the-art artificial muscles at a lower cost. The system showcases human-like dexterity with independent finger movements, demonstrating successful grasping of various objects, ranging from a small, lightweight coin to a large can of 272g in weight. The performance evaluation, based on fingertip and grasping forces along with finger joint range of motion, highlights the system's potential.
♻ ☆ Mobile Robotic Multi-View Photometric Stereo SP
Multi-View Photometric Stereo (MVPS) is a popular method for fine-detailed 3D acquisition of an object from images. Despite its outstanding results on diverse material objects, a typical MVPS experimental setup requires a well-calibrated light source and a monocular camera installed on an immovable base. This restricts the use of MVPS on a movable platform, limiting us from taking MVPS benefits in 3D acquisition for mobile robotics applications. To this end, we introduce a new mobile robotic system for MVPS. While the proposed system brings advantages, it introduces additional algorithmic challenges. Addressing them, in this paper, we further propose an incremental approach for mobile robotic MVPS. Our approach leverages a supervised learning setup to predict per-view surface normal, object depth, and per-pixel uncertainty in model-predicted results. A refined depth map per view is obtained by solving an MVPS-driven optimization problem proposed in this paper. Later, we fuse the refined depth map while tracking the camera pose w.r.t the reference frame to recover globally consistent object 3D geometry. Experimental results show the advantages of our robotic system and algorithm, featuring the local high-frequency surface detail recovery with globally consistent object shape. Our work is beyond any MVPS system yet presented, providing encouraging results on objects with unknown reflectance properties using fewer frames without a tiring calibration and installation process, enabling computationally efficient robotic automation approach to photogrammetry. The proposed approach is nearly 100 times computationally faster than the state-of-the-art MVPS methods such as [1, 2] while maintaining the similar results when tested on subjects taken from the benchmark DiLiGenT MV dataset [3].
comment: Acknowledgment Added. Published at International Society Journal of Photogrammetry and Remote Sensing (ISPRS). 32 pages, 14 Figures, 5 Tables
♻ ☆ FRASA: An End-to-End Reinforcement Learning Agent for Fall Recovery and Stand Up of Humanoid Robots
Humanoid robotics faces significant challenges in achieving stable locomotion and recovering from falls in dynamic environments. Traditional methods, such as Model Predictive Control (MPC) and Key Frame Based (KFB) routines, either require extensive fine-tuning or lack real-time adaptability. This paper introduces FRASA, a Deep Reinforcement Learning (DRL) agent that integrates fall recovery and stand up strategies into a unified framework. Leveraging the Cross-Q algorithm, FRASA significantly reduces training time and offers a versatile recovery strategy that adapts to unpredictable disturbances. Comparative tests on Sigmaban humanoid robots demonstrate FRASA superior performance against the KFB method deployed in the RoboCup 2023 by the Rhoban Team, world champion of the KidSize League.
♻ ☆ Extended Friction Models for the Physics Simulation of Servo Actuators
Accurate physical simulation is crucial for the development and validation of control algorithms in robotic systems. Recent works in Reinforcement Learning (RL) take notably advantage of extensive simulations to produce efficient robot control. State-of-the-art servo actuator models generally fail at capturing the complex friction dynamics of these systems. This limits the transferability of simulated behaviors to real-world applications. In this work, we present extended friction models that allow to more accurately simulate servo actuator dynamics. We propose a comprehensive analysis of various friction models, present a method for identifying model parameters using recorded trajectories from a pendulum test bench, and demonstrate how these models can be integrated into physics engines. The proposed friction models are validated on four distinct servo actuators and tested on 2R manipulators, showing significant improvements in accuracy over the standard Coulomb-Viscous model. Our results highlight the importance of considering advanced friction effects in the simulation of servo actuators to enhance the realism and reliability of robotic simulations.
♻ ☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
comment: 20 pages, 10 figures
♻ ☆ Virtual Target Trajectory Prediction for Stochastic Targets
Trajectory prediction of aerial vehicles is a key requirement in applications ranging from missile guidance to UAV collision avoidance. While most prediction methods assume deterministic target motion, real-world targets often exhibit stochastic behaviors such as evasive maneuvers or random gliding patterns. This paper introduces a probabilistic framework based on Conditional Normalizing Flows (CNFs) to model and predict such stochastic dynamics directly from trajectory data. The learned model generates probability distributions of future target positions conditioned on initial states and dynamic parameters, enabling efficient sampling and exact density evaluation. To provide deterministic surrogates compatible with existing guidance and planning algorithms, sampled trajectories are clustered using a time series k-means approach, yielding a set of representative "virtual target" trajectories. The method is target-agnostic, computationally efficient, and requires only trajectory data for training, making it suitable as a drop-in replacement for deterministic predictors. Simulated scenarios with maneuvering and ballistic targets demonstrate that the proposed approach bridges the gap between deterministic assumptions and stochastic reality, advancing guidance and control algorithms for autonomous vehicles.
comment: Manuscript accepted by Journal of Guidance, Control, and Dynamics
♻ ☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
♻ ☆ Radar-Based Odometry for Low-Speed Driving
We address automotive odometry for low-speed driving and parking, where centimeter-level accuracy is required due to tight spaces and nearby obstacles. Traditional methods using inertial-measurement units and wheel encoders require vehicle-specific calibration, making them costly for consumer-grade vehicles. To overcome this, we propose a radar-based simultaneous localization and mapping (SLAM) approach that fuses inertial and 4D radar measurements. Our approach tightly couples feature positions and Doppler velocities for accurate localization and robust data association. Key contributions include a tightly coupled radar-Doppler extended Kalman filter, multi-radar support and an information-based feature-pruning strategy. Experiments using both proprietary and public datasets demonstrate high-accuracy localization during low-speed driving.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ No Plan but Everything Under Control: Robustly Solving Sequential Tasks with Dynamically Composed Gradient Descent ICRA25
We introduce a novel gradient-based approach for solving sequential tasks by dynamically adjusting the underlying myopic potential field in response to feedback and the world's regularities. This adjustment implicitly considers subgoals encoded in these regularities, enabling the solution of long sequential tasks, as demonstrated by solving the traditional planning domain of Blocks World - without any planning. Unlike conventional planning methods, our feedback-driven approach adapts to uncertain and dynamic environments, as demonstrated by one hundred real-world trials involving drawer manipulation. These experiments highlight the robustness of our method compared to planning and show how interactive perception and error recovery naturally emerge from gradient descent without explicitly implementing them. This offers a computationally efficient alternative to planning for a variety of sequential tasks, while aligning with observations on biological problem-solving strategies.
comment: Accepted at ICRA25; Supplementary Material under https://www.tu.berlin/robotics/papers/noplan ; 7 pages + 6 figures;
♻ ☆ UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning
Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
comment: project page: https://sites.google.com/deemos.com/kinematify
♻ ☆ Generative World Models of Tasks: LLM-Driven Hierarchical Scaffolding for Embodied Agents NeurIPS 2025
Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long-horizon multi-agent tasks such as robotic soccer, this end-to-end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision-making must model the world's physics and also its task semantics. A systematic review of 2024 research in low-resource multi-agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi-agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general-purpose agents with greater sample efficiency than purely end-to-end approaches.
comment: In the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Embodied World Models for Decision Making (EWM)
♻ ☆ Neural Network Aided Kalman Filtering with Model Predictive Control Enables Robot-Assisted Drone Recovery on a Wavy Surface
Recovering a drone on a disturbed water surface remains a significant challenge in maritime robotics. In this paper, we propose a unified framework for robot-assisted drone recovery on a wavy surface that addresses two major tasks: Firstly, accurate prediction of a moving drone's position under wave-induced disturbances using KalmanNet Plus Plus (KalmanNet++), a Neural Network Aided Kalman Filtering we proposed. Secondly, effective motion planning using the desired position we got for a manipulator via Receding Horizon Model Predictive Control (RHMPC). Specifically, we compared multiple prediction methods and proposed KalmanNet Plus Plus to predict the position of the UAV, thereby obtaining the desired position. The KalmanNet++ predicts the drone's future position 0.1\,s ahead, while the manipulator plans a capture trajectory in real time, thus overcoming not only wave-induced base motions but also limited constraints such as torque constraints and joint constraints. For the system design, we provide a collaborative system, comprising a manipulator subsystem and a UAV subsystem, enables drone lifting and drone recovery. Simulation and real-world experiments using wave-disturbed motion data demonstrate that our approach achieves a high success rate - above 95\% and outperforms conventional baseline methods by up to 10\% in efficiency and 20\% in precision. The results underscore the feasibility and robustness of our system, which achieves state-of-the-art performance and offers a practical solution for maritime drone operations.
comment: 17 pages, 51 figures
♻ ☆ Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV 2026
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
comment: 9 pages including appendix, 4 tables, 8 figures, to be submitted to WACV 2026
♻ ☆ Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective planning (CP) module which receives concrete feedback on failed solution attempts and feed them with constraints back to ViLaIn to refine the specification. We design challenging manipulation tasks in a cooking domain and evaluate our framework. Experimental results demonstrate that ViLaIn-TAMP outperforms a VLM-as-a-planner baseline by 18% in mean success rate, and that adding the CP module boosts mean success rate by 32%.
comment: Project website: https://omron-sinicx.github.io/ViLaIn-TAMP/
♻ ☆ Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras NeurIPS 2025
Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
comment: NeurIPS 2025 Spotlight; 43 pages, 17 figures, 16 tables; Project Page at https://talk2event.github.io
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.
comment: 15 pages, 8 figures
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.
♻ ☆ Adv-BMT: Bidirectional Motion Transformer for Safety-Critical Traffic Scenario Generation
Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial traffic interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes agent information in the last time step of the scenario as input, and reconstructs the traffic in the inverse of chronological order until the initial time step. The Adv-BMT framework is a two-staged pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining, and are able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20%. Demo and code are available at: https://metadriverse.github.io/adv-bmt/.
♻ ☆ MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving ACM MM 2025
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.
comment: ACM MM 2025 Most Popular Demo Award
Computer Vision and Pattern Recognition 127
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
comment: Project page: https://diddone.github.io/densemarks/ .Video: https://youtu.be/o8DOOYFW0gI .21 pages, 13 figures, 2 tables
☆ PLUTO-4: Frontier Pathology Foundation Models
Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.
☆ AI-Generated Image Detection: An Empirical Study and Future Research Directions
The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.
☆ When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
comment: 28 pages, 15 figures
☆ VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
comment: Project page: https://csu-jpg.github.io/VCode Github: https://github.com/CSU-JPG/VCode
☆ PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
comment: Project Page: https://antoniooroz.github.io/PercHead/ Video: https://www.youtube.com/watch?v=4hFybgTk4kE
☆ Dynamic Reflections: Probing Video Representations with Text Alignment
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/
comment: 21 pages, 12 figures
☆ LLEXICORP: End-user Explainability of Convolutional Neural Networks
Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.
☆ An unscented Kalman filter method for real time input-parameter-state estimation
The input-parameter-state estimation capabilities of a novel unscented Kalman filter is examined herein on both linear and nonlinear systems. The unknown input is estimated in two stages within each time step. Firstly, the predicted dynamic states and the system parameters provide an estimation of the input. Secondly, the corrected with measurements states and parameters provide a final estimation. Importantly, it is demonstrated using the perturbation analysis that, a system with at least a zero or a non-zero known input can potentially be uniquely identified. This output-only methodology allows for a better understanding of the system compared to classical output-only parameter identification strategies, given that all the dynamic states, the parameters, and the input are estimated jointly and in real-time.
comment: author-accepted manuscript (AAM) published in Mechanical Systems and Signal Processing
☆ VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
comment: 41 pages, 26 figures
☆ Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.
☆ Differentiable Hierarchical Visual Tokenization NeurIPS 2025
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
comment: NeurIPS 2025 Spotlight
☆ Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.
☆ Robust Face Liveness Detection for Biometric Authentication using Single Image
Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa
☆ UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.
☆ Zero-Shot Multi-Animal Tracking in the Wild
Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
☆ A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
comment: 9 pages main text with 6 figures (excluding references), supplementary material included
☆ Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID
☆ The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.
☆ SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.
☆ Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
☆ Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
☆ LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.
☆ Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.
comment: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025 International Conference, PARIS, FRANCE December 3-5 2025
☆ ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly.To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com
☆ Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
☆ DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding NeurIPS 2025
Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
comment: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, https://neurips.cc/virtual/2025/loc/san-diego/poster/121400
☆ Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.
comment: 20 pages, Submitted to IEEE TIM
☆ OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
☆ MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer ICIP2024
Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.
comment: Selected as Best Industry Paper Award at ICIP2024
☆ HAGI++: Head-Assisted Gaze Imputation and Generation
Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.
comment: Extended version of our UIST'25 paper "HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers"
☆ KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.
comment: 18 pages
☆ From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
☆ A Kullback-Leibler divergence method for input-system-state identification
The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.
comment: 32 pages, 17 figures, published in Journal of Sound and Vibration
☆ Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.
☆ ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension EMNLP25
Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
comment: 23 pages, EMNLP25 Accepted
☆ IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.
☆ Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: https://github.com/ivpml84079/Point-cloud-color-correction.
Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds SC 2025
Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
comment: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
☆ RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
☆ CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
☆ M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.
☆ GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC's competitive performance and strong generalization ability compared to those of state-of-the-art methods.
☆ Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization NeurIPS 2025
We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) -- originally developed for group synchronization -- to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone -- without access to inter-camera distances -- suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.
comment: NeurIPS 2025 spotlight paper
☆ 3D Point Cloud Object Detection on Edge Devices for Split Computing
The field of autonomous driving technology is rapidly advancing, with deep learning being a key component. Particularly in the field of sensing, 3D point cloud data collected by LiDAR is utilized to run deep neural network models for 3D object detection. However, these state-of-the-art models are complex, leading to longer processing times and increased power consumption on edge devices. The objective of this study is to address these issues by leveraging Split Computing, a distributed machine learning inference method. Split Computing aims to lessen the computational burden on edge devices, thereby reducing processing time and power consumption. Furthermore, it minimizes the risk of data breaches by only transmitting intermediate data from the deep neural network model. Experimental results show that splitting after voxelization reduces the inference time by 70.8% and the edge device execution time by 90.0%. When splitting within the network, the inference time is reduced by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.
comment: 6 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at ACM/IEEE RAGE 2024
☆ Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions ICDAR2025
We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.
comment: accepted for ICDAR2025-WML
☆ SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.
☆ Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows? BMVC 2025
Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.
comment: BMVC 2025 workshop proceedings (Smart Cameras for Smarter Autonomous Vehicles & Robots)
☆ Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
☆ Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.
☆ Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
Alzheimer's disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.
☆ Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.
☆ High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network
This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.
☆ Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.
comment: 13 pages, IEEE Journal of Biomedical and Health Informatics
☆ Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.
comment: 11 pages, 4 figures, 3 tables
☆ Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.
comment: 31 pages, 8 figures
☆ OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
comment: 25 pages, 12 figures, 8 tables
☆ MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.
comment: This paper was accepted by IEEE BIBM 2025 conference
☆ Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models ICCV2025
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
comment: 1st place winner of Grounded Videoqa track at the ICCV2025 Perception Test
☆ Autobiasing Event Cameras for Flickering Mitigation
Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
☆ Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.
☆ From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.
♻ ☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene Editing), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, GS-Verse facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
♻ ☆ DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging NeurIPS 2025
Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard the training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping the training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code: https://github.com/FelixWag/DIsoN
comment: Accepted at NeurIPS 2025
♻ ☆ A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.
comment: TMLR https://openreview.net/forum?id=loT6xhgLYK
♻ ☆ The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
♻ ☆ Image Super-Resolution with Guarantees via Conformalized Generative Models NeurIPS 2025
The increasing use of generative ML foundation models for image restoration tasks such as super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a 'confidence mask' capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method's solid performance.
comment: To appear at NeurIPS 2025. 17 pages, 7 figures
♻ ☆ Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation
Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.
comment: Our code is available at https://github.com/Prince-Lee-PathAI/PG-CIDL
♻ ☆ Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
comment: A project page associated with this survey is available at https://fnzhan.com/projects/Feed-Forward-3D
♻ ☆ Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
♻ ☆ Mobile Robotic Multi-View Photometric Stereo SP
Multi-View Photometric Stereo (MVPS) is a popular method for fine-detailed 3D acquisition of an object from images. Despite its outstanding results on diverse material objects, a typical MVPS experimental setup requires a well-calibrated light source and a monocular camera installed on an immovable base. This restricts the use of MVPS on a movable platform, limiting us from taking MVPS benefits in 3D acquisition for mobile robotics applications. To this end, we introduce a new mobile robotic system for MVPS. While the proposed system brings advantages, it introduces additional algorithmic challenges. Addressing them, in this paper, we further propose an incremental approach for mobile robotic MVPS. Our approach leverages a supervised learning setup to predict per-view surface normal, object depth, and per-pixel uncertainty in model-predicted results. A refined depth map per view is obtained by solving an MVPS-driven optimization problem proposed in this paper. Later, we fuse the refined depth map while tracking the camera pose w.r.t the reference frame to recover globally consistent object 3D geometry. Experimental results show the advantages of our robotic system and algorithm, featuring the local high-frequency surface detail recovery with globally consistent object shape. Our work is beyond any MVPS system yet presented, providing encouraging results on objects with unknown reflectance properties using fewer frames without a tiring calibration and installation process, enabling computationally efficient robotic automation approach to photogrammetry. The proposed approach is nearly 100 times computationally faster than the state-of-the-art MVPS methods such as [1, 2] while maintaining the similar results when tested on subjects taken from the benchmark DiLiGenT MV dataset [3].
comment: Acknowledgment Added. Published at International Society Journal of Photogrammetry and Remote Sensing (ISPRS). 32 pages, 14 Figures, 5 Tables
♻ ☆ GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
comment: NeurlPS 2025 Spotlight
♻ ☆ Label tree semantic losses for rich multi-class medical image segmentation
Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
♻ ☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
comment: 20 pages, 10 figures
♻ ☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions ICCV2025
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
comment: ICCV2025 Workshop
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
comment: ICSPS 2025
♻ ☆ Robust Identity Perceptual Watermark Against Deepfake Face Swapping
Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue in cross-domain scenarios. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, to fulfill the research gap, we propose a robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We innovatively assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and nonreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are robustly encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. For a suspect image, falsification is accomplished by justifying the consistency between the content-matched identity perceptual watermark and the recovered robust watermark, without requiring the ground-truth. Moreover, source tracing can be accomplished based on the identity semantics that the recovered watermark carries. Extensive experiments demonstrate state-of-the-art detection and source tracing performance against Deepfake face swapping with promising watermark robustness for both cross-dataset and cross-manipulation settings.
comment: In peer review
♻ ☆ RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing NeurIPS 2025
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
comment: NeurIPS 2025
♻ ☆ ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
comment: project page: https://thinkmorph.github.io/
♻ ☆ Training Convolutional Neural Networks with the Forward-Forward algorithm SC
Recent successes in image analysis with deep neural networks are achieved almost exclusively with Convolutional Neural Networks (CNNs), typically trained using the backpropagation (BP) algorithm. In a 2022 preprint, Geoffrey Hinton proposed the Forward-Forward (FF) algorithm as a biologically inspired alternative, where positive and negative examples are jointly presented to the network and training is guided by a locally defined goodness function. Here, we extend the FF paradigm to CNNs. We introduce two spatially extended labeling strategies, based on Fourier patterns and morphological transformations, that enable convolutional layers to access label information across all spatial positions. On CIFAR10, we show that deeper FF-trained CNNs can be optimized successfully and that morphology-based labels prevent shortcut solutions on dataset with more complex and fine features. On CIFAR100, carefully designed label sets scale effectively to 100 classes. Class Activation Maps reveal that FF-trained CNNs learn meaningful and complementary features across layers. Together, these results demonstrate that FF training is feasible beyond fully connected networks, provide new insights into its learning dynamics and stability, and highlight its potential for neuromorphic computing and biologically inspired learning.
comment: PEER-REVIEWED VERSION PUBLISHED ON "SCIENTIFIC REPORTS" (2025) DOI: 10.1038/s41598-025-26235-2
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.
♻ ☆ OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
♻ ☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
♻ ☆ A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade-offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel-level evaluation strategies with carefully designed metrics to yield fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.
♻ ☆ Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images. However, existing methods are fundamentally constrained by the inherent ambiguity of single-frame input, which leads to inaccurate geometric predictions and poor lane integrity, especially for distant lanes.To overcome this, we propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves. Our proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically leverages the temporal information from complementary perspectives.First, Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across consecutive frames, effectively recovering depth information from motion to build a reliable 3D scene representation.Second, to enhance lane integrity, Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues from past and present frames. Crucially, for lanes that are ambiguous in the current view, TIQG innovatively synthesizes a pseudo future perspective to generate queries that reveal lanes which would otherwise be missed.The experiments demonstrate that GTA-Net achieves new SoTA results, significantly outperforming existing monocular 3D lane detection solutions.
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
♻ ☆ Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation MICCAI2024
In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
comment: Journal Extension of "Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks" (MICCAI2024) submitted to MedIA
♻ ☆ CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays
Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
♻ ☆ Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction
Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we show that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves generalization. For training and evaluation, we used MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrated a statistically significant improvement in classification metrics across both domains by replacing batch normalization with instance normalization in the baseline model, generating augmented input views using the Global Intensity Non-linear (GIN) augmentation method, and incorporating a supervised contrastive loss alongside the classification loss to align representations of samples with the same label. The GIN method with contrastive loss performed better than all evaluated single-source domain generalization methods when using 3D instance normalization. Comparing GIN with and without contrastive loss (for both normalization types) showed that adding contrastive loss consistently led to better performance.
♻ ☆ Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention
Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing "detection-prediction" systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7\% mAP (6.5 percentage points higher than baseline) with 5.3\% occlusion miss rate. DeepSort reached 93.8\% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7\% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95\% warning accuracy, over 90\% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.
♻ ☆ Towards classification-based representation learning for place recognition on LiDAR scans
Place recognition is a crucial task in autonomous driving, allowing vehicles to determine their position using sensor data. While most existing methods rely on contrastive learning, we explore an alternative approach by framing place recognition as a multi-class classification problem. Our method assigns discrete location labels to LiDAR scans and trains an encoder-decoder model to classify each scan's position directly. We evaluate this approach on the NuScenes dataset and show that it achieves competitive performance compared to contrastive learning-based methods while offering advantages in training efficiency and stability.
♻ ☆ MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.
♻ ☆ CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%, and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70\%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning and addresses the limitations of one-step KD. Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision. Extensive evaluation on six imaging datasets across otoscopic, microscopic, and magnetic resonance imaging modalities shows that it generalizes and outperforms existing methods. Furthermore, the proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
♻ ☆ Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
comment: project page: https://sites.google.com/deemos.com/kinematify
♻ ☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
♻ ☆ Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection
The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models' predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.
comment: Accepted by TCSVT2025
♻ ☆ Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code is publicly available at https://github.com/JJessicaYao/Crucial-diff.
comment: Accepted by IEEE Transactions on Image Processing (TIP), 2025
♻ ☆ InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
♻ ☆ FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.
comment: 12pages,6 figures
♻ ☆ 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 81.51\% accuracy, 85.94\% sensitivity, 76.36\% specificity, 80.88\% precision, and 83.33\% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
comment: 17 pages, 3 figure, 9 tables
♻ ☆ Parameterized Prompt for Incremental Object Detection
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
♻ ☆ Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV 2026
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
comment: 9 pages including appendix, 4 tables, 8 figures, to be submitted to WACV 2026
♻ ☆ Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution ICCV 2025
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency and high-frequency sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
comment: ICCV 2025 Oral Paper
♻ ☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
♻ ☆ Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras NeurIPS 2025
Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
comment: NeurIPS 2025 Spotlight; 43 pages, 17 figures, 16 tables; Project Page at https://talk2event.github.io
♻ ☆ FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks
Proactive Deepfake detection via robust watermarks has seen interest ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and performs one-way encryption of the selected parameters. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Moreover, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
comment: ACM Multimedia 2025 Oral
♻ ☆ Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
comment: 15 pages, 8 figures, 4 tables
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a hierarchical identity-preserving attention mechanism, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce semantic understanding via pretrained vision-language model (VLM), leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an online reinforcement learning phase to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
♻ ☆ Learning with Category-Equivariant Architectures for Human Activity Recognition
We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.
♻ ☆ WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
comment: Under review
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
The sparse cross-modality detector offers more advantages than its counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.
comment: 13 pages
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting
Recent advancements in text-to-3D generation have shown remarkable results by leveraging 3D priors in combination with 2D diffusion. However, previous methods utilize 3D priors that lack detailed and complex structural information, limiting them to generating simple objects and presenting challenges for creating intricate structures such as bonsai. In this paper, we propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with complex structures. Technically, we first design a trainable 3D space colonization algorithm to produce bonsai structures, which are then enhanced through random sampling and point cloud augmentation to serve as the 3D Gaussian priors. We introduce two bonsai generation pipelines with distinct structural levels: fine structure conditioned generation, which initializes 3D Gaussians using a 3D structure prior to produce detailed and complex bonsai, and coarse structure conditioned generation, which employs a multi-view structure consistency module to align 2D and 3D structures. Moreover, we have compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental results demonstrate that 3DBonsai significantly outperforms existing methods, providing a new benchmark for structure-aware 3D bonsai generation.
♻ ☆ DRIP: Dynamic patch Reduction via Interpretable Pooling
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
comment: Need more refinement
♻ ☆ Weakly Supervised Object Segmentation by Background Conditional Divergence
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images and then, during learning, create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The code for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
comment: Published in TMLR: https://openreview.net/forum?id=2JJZhfGvMW
♻ ☆ Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification NeurIPS 2025
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
comment: Published in NeurIPS 2025
♻ ☆ Visual Program Distillation with Template-Based Augmentation EMNLP
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference
comment: EMNLP Camera Ready
Artificial Intelligence 10
☆ Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
comment: 16 pages, 7 figures, 14 tables. Under Review
☆ Neurosymbolic Deep Learning Semantics
Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI's scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.
☆ Kosmos: An AI Scientist for Autonomous Discovery
Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data-driven discovery. Given an open-ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20-cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.
☆ Optimizing AI Agent Attacks With Synthetic Data
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
☆ Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .
☆ Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
comment: Preprint
☆ Assessing win strength in MLB win prediction models
In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
♻ ☆ Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
♻ ☆ ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
As AI systems become more advanced, ensuring their alignment with a diverse range of individuals and societal values becomes increasingly critical. But how can we capture fundamental human values and assess the degree to which AI systems align with them? We introduce ValueCompass, a framework of fundamental values, grounded in psychological theory and a systematic review, to identify and evaluate human-AI alignment. We apply ValueCompass to measure the value alignment of humans and large language models (LLMs) across four real-world scenarios: collaborative writing, education, public sectors, and healthcare. Our findings reveal concerning misalignments between humans and LLMs, such as humans frequently endorse values like "National Security" which were largely rejected by LLMs. We also observe that values differ across scenarios, highlighting the need for context-aware AI alignment strategies. This work provides valuable insights into the design space of human-AI alignment, laying the foundations for developing AI systems that responsibly reflect societal values and ethics.
♻ ☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
comment: Clarified expectation-value-based readouts and made minor text edits
Machine Learning 222
☆ Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
comment: 16 pages, 7 figures, 14 tables. Under Review
☆ In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ GeoCrossBench: Cross-Band Generalization for Remote Sensing
The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.
☆ Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity
We develop new accelerated first-order algorithms in the Frank-Wolfe (FW) family for minimizing smooth convex functions over compact convex sets, with a focus on two prominent constraint classes: (1) polytopes and (2) matrix domains given by the spectrahedron and the unit nuclear-norm ball. A key technical ingredient is a complementarity condition that captures solution sparsity -- face dimension for polytopes and rank for matrices. We present two algorithms: (1) a purely linear optimization oracle (LOO) method for polytopes that has optimal worst-case first-order (FO) oracle complexity and, aside of a finite \emph{burn-in} phase and up to a logarithmic factor, has LOO complexity that scales with $r/\sqrt{\epsilon}$, where $\epsilon$ is the target accuracy and $r$ is the solution sparsity $r$ (independently of the ambient dimension), and (2) a hybrid scheme that combines FW with a sparse projection oracle (e.g., low-rank SVDs for matrix domains with low-rank solutions), which also has optimal FO oracle complexity, and after a finite burn-in phase, only requires $O(1/\sqrt{\epsilon})$ sparse projections and LOO calls (independently of both the ambient dimension and the rank of optimal solutions). Our results close a gap on how to accelerate recent advancements in linearly-converging FW algorithms for strongly convex optimization, without paying the price of the dimension.
☆ Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .
☆ Assessing win strength in MLB win prediction models
In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
☆ TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models
Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at https://github.com/Lexsi-Labs/TabTune .
☆ Fast, Private, and Protected: Safeguarding Data Privacy and Defending Against Model Poisoning Attacks in Federated Learning
Federated Learning (FL) is a distributed training paradigm wherein participants collaborate to build a global model while ensuring the privacy of the involved data, which remains stored on participant devices. However, proposals aiming to ensure such privacy also make it challenging to protect against potential attackers seeking to compromise the training outcome. In this context, we present Fast, Private, and Protected (FPP), a novel approach that aims to safeguard federated training while enabling secure aggregation to preserve data privacy. This is accomplished by evaluating rounds using participants' assessments and enabling training recovery after an attack. FPP also employs a reputation-based mechanism to mitigate the participation of attackers. We created a dockerized environment to validate the performance of FPP compared to other approaches in the literature (FedAvg, Power-of-Choice, and aggregation via Trimmed Mean and Median). Our experiments demonstrate that FPP achieves a rapid convergence rate and can converge even in the presence of malicious participants performing model poisoning attacks.
☆ Can LLMs subtract numbers?
We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a
comment: Work-in-progress; MathNLP non-archival presentation
☆ Enhancing Federated Learning Privacy with QUBO
Federated learning (FL) is a widely used method for training machine learning (ML) models in a scalable way while preserving privacy (i.e., without centralizing raw data). Prior research shows that the risk of exposing sensitive data increases cumulatively as the number of iterations where a client's updates are included in the aggregated model increase. Attackers can launch membership inference attacks (MIA; deciding whether a sample or client participated), property inference attacks (PIA; inferring attributes of a client's data), and model inversion attacks (MI; reconstructing inputs), thereby inferring client-specific attributes and, in some cases, reconstructing inputs. In this paper, we mitigate risk by substantially reducing per client exposure using a quantum computing-inspired quadratic unconstrained binary optimization (QUBO) formulation that selects a small subset of client updates most relevant for each training round. In this work, we focus on two threat vectors: (i) information leakage by clients during training and (ii) adversaries who can query or obtain the global model. We assume a trusted central server and do not model server compromise. This method also assumes that the server has access to a validation/test set with global data distribution. Experiments on the MNIST dataset with 300 clients in 20 rounds showed a 95.2% per-round and 49% cumulative privacy exposure reduction, with 147 clients' updates never being used during training while maintaining in general the full-aggregation accuracy or even better. The method proved to be efficient at lower scale and more complex model as well. A CINIC-10 dataset-based experiment with 30 clients resulted in 82% per-round privacy improvement and 33% cumulative privacy.
comment: 8 pages, 9 figures
☆ Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
☆ STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation
The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.
comment: 16 pages, 3 figures, 2 tables
☆ VecComp: Vector Computing via MIMO Digital Over-the-Air Computation
Recently, the ChannelComp framework has proposed digital over-the-air computation by designing digital modulations that enable the computation of arbitrary functions. Unlike traditional analog over-the-air computation, which is restricted to nomographic functions, ChannelComp enables a broader range of computational tasks while maintaining compatibility with digital communication systems. This framework is intended for applications that favor local information processing over the mere acquisition of data. However, ChannelComp is currently designed for scalar function computation, while numerous data-centric applications necessitate vector-based computations, and it is susceptible to channel fading. In this work, we introduce a generalization of the ChannelComp framework, called VecComp, by integrating ChannelComp with multiple-antenna technology. This generalization not only enables vector function computation but also ensures scalability in the computational complexity, which increases only linearly with the vector dimension. As such, VecComp remains computationally efficient and robust against channel impairments, making it suitable for high-dimensional, data-centric applications. We establish a non-asymptotic upper bound on the mean squared error of VecComp, affirming its computation efficiency under fading channel conditions. Numerical experiments show the effectiveness of VecComp in improving the computation of vector functions and fading compensation over noisy and fading multiple-access channels.
☆ From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos
Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
☆ ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
☆ DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications
Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.
☆ Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning
We argue that sixth-generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose -- to simulate future scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe open radio access network (O-RAN) near-real-time (Near-RT) control via counterfactual dynamics and a world modeling (WM) paradigm that learns an action-conditioned generative state space. This enables quantitative "what-if" forecasting beyond large language models (LLMs) as the primary modeling primitive. Actions such as physical resource blocks (PRBs) are treated as first-class control inputs in a causal world model, and both aleatoric and epistemic uncertainty are modeled for prediction and what-if analysis. An agentic, model predictive control (MPC)-based cross-entropy method (CEM) planner operates over short horizons, using prior-mean rollouts within data-driven PRB bounds to maximize a deterministic reward. The model couples multi-scale structured state-space mixtures (MS3M) with a compact stochastic latent to form WM-MS3M, summarizing key performance indicators (KPIs) histories and predicting next-step KPIs under hypothetical PRB sequences. On realistic O-RAN traces, WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference, enabling rare-event simulation and offline policy screening.
comment: 13 Pages, 3 Figures, 4 Tables
☆ Calibration improves detection of mislabeled examples
Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
☆ Does Interpretability of Knowledge Tracing Models Support Teacher Decision Making?
Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners' abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask $N=12$ human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.
comment: in press at the Workshop on Epistemics and Decision-Making in AI-Supported Education, AIED 2025
☆ Optimizing Kernel Discrepancies via Subset Selection
Kernel discrepancies are a powerful tool for analyzing worst-case errors in quasi-Monte Carlo (QMC) methods. Building on recent advances in optimizing such discrepancy measures, we extend the subset selection problem to the setting of kernel discrepancies, selecting an m-element subset from a large population of size $n \gg m$. We introduce a novel subset selection algorithm applicable to general kernel discrepancies to efficiently generate low-discrepancy samples from both the uniform distribution on the unit hypercube, the traditional setting of classical QMC, and from more general distributions $F$ with known density functions by employing the kernel Stein discrepancy. We also explore the relationship between the classical $L_2$ star discrepancy and its $L_\infty$ counterpart.
☆ Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs NeurIPS'25
Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
comment: NeurIPS'25 paper
☆ Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
☆ RL-Aided Cognitive ISAC: Robust Detection and Sensing-Communication Trade-offs
This paper proposes a reinforcement learning (RL)-aided cognitive framework for massive MIMO-based integrated sensing and communication (ISAC) systems employing a uniform planar array (UPA). The focus is on enhancing radar sensing performance in environments with unknown and dynamic disturbance characteristics. A Wald-type detector is employed for robust target detection under non-Gaussian clutter, while a SARSA-based RL algorithm enables adaptive estimation of target positions without prior environmental knowledge. Based on the RL-derived sensing information, a joint waveform optimization strategy is formulated to balance radar sensing accuracy and downlink communication throughput. The resulting design provides an adaptive trade-off between detection performance and achievable sum rate through an analytically derived closed-form solution. Monte Carlo simulations demonstrate that the proposed cognitive ISAC framework achieves significantly improved detection probability compared to orthogonal and non-learning adaptive baselines, while maintaining competitive communication performance. These results underline the potential of RL-assisted sensing for robust and spectrum-efficient ISAC in next-generation wireless networks.
comment: 29 pages, 14 figures. Invited paper, submitted to the EURASIP Journal on Wireless Communications and Networking (JWCN)
☆ Scalable Evaluation and Neural Models for Compositional Generalization NeurIPS
Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.
comment: Accepted at the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
☆ In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
comment: 17 pages, 8 figures, 4 tables
☆ Nesterov-Accelerated Robust Federated Learning Over Byzantine Adversaries
We investigate robust federated learning, where a group of workers collaboratively train a shared model under the orchestration of a central server in the presence of Byzantine adversaries capable of arbitrary and potentially malicious behaviors. To simultaneously enhance communication efficiency and robustness against such adversaries, we propose a Byzantine-resilient Nesterov-Accelerated Federated Learning (Byrd-NAFL) algorithm. Byrd-NAFL seamlessly integrates Nesterov's momentum into the federated learning process alongside Byzantine-resilient aggregation rules to achieve fast and safeguarding convergence against gradient corruption. We establish a finite-time convergence guarantee for Byrd-NAFL under non-convex and smooth loss functions with relaxed assumption on the aggregated gradients. Extensive numerical experiments validate the effectiveness of Byrd-NAFL and demonstrate the superiority over existing benchmarks in terms of convergence speed, accuracy, and resilience to diverse Byzantine attack strategies.
☆ Apriel-H1: Towards Efficient Enterprise Reasoning Models
Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.
☆ Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn's potential to deliver scalability and efficiency in real-world edge deployments.
☆ Natural-gas storage modelling by deep reinforcement learning
We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
comment: 8 pages, 5 figures, published on
☆ Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning
We study computable probably approximately correct (CPAC) learning, where learners are required to be computable functions. It had been previously observed that the Fundamental Theorem of Statistical Learning, which characterizes PAC learnability by finiteness of the Vapnik-Chervonenkis (VC-)dimension, no longer holds in this framework. Recent works recovered analogs of the Fundamental Theorem in the computable setting, for instance by introducing an effective VC-dimension. Guided by this, we investigate the connection between CPAC learning and recursively enumerable representable (RER) classes, whose members can be algorithmically listed. Our results show that the effective VC-dimensions can take arbitrary values above the traditional one, even for RER classes, which creates a whole family of (non-)examples for various notions of CPAC learning. Yet the two dimensions coincide for classes satisfying sufficiently strong notions of CPAC learning. We then observe that CPAC learnability can also be characterized via containment of RER classes that realize the same samples. Furthermore, it is shown that CPAC learnable classes satisfying a unique identification property are necessarily RER. Finally, we establish that agnostic learnability can be guaranteed for RER classes, by considering the relaxed notion of nonuniform CPAC learning.
☆ The stability of shallow neural networks on spheres: A sharp spectral analysis
We present an estimation of the condition numbers of the \emph{mass} and \emph{stiffness} matrices arising from shallow ReLU$^k$ neural networks defined on the unit sphere~$\mathbb{S}^d$. In particular, when $\{\theta_j^*\}_{j=1}^n \subset \mathbb{S}^d$ is \emph{antipodally quasi-uniform}, the condition number is sharp. Indeed, in this case, we obtain sharp asymptotic estimates for the full spectrum of eigenvalues and characterize the structure of the corresponding eigenspaces, showing that the smallest eigenvalues are associated with an eigenbasis of low-degree polynomials while the largest eigenvalues are linked to high-degree polynomials. This spectral analysis establishes a precise correspondence between the approximation power of the network and its numerical stability.
☆ Verifying LLM Inference to Prevent Model Weight Exfiltration
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of 0.01%, corresponding to a >200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
☆ A Non-Adversarial Approach to Idempotent Generative Modelling
Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially -- an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN's ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.
☆ Neural Network Interoperability Across Platforms
The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.
☆ A Large Language Model for Corporate Credit Scoring
We introduce Omega^2, a Large Language Model-driven framework for corporate credit scoring that combines structured financial data with advanced machine learning to improve predictive reliability and interpretability. Our study evaluates Omega^2 on a multi-agency dataset of 7,800 corporate credit ratings drawn from Moody's, Standard & Poor's, Fitch, and Egan-Jones, each containing detailed firm-level financial indicators such as leverage, profitability, and liquidity ratios. The system integrates CatBoost, LightGBM, and XGBoost models optimized through Bayesian search under temporal validation to ensure forward-looking and reproducible results. Omega^2 achieved a mean test AUC above 0.93 across agencies, confirming its ability to generalize across rating systems and maintain temporal consistency. These results show that combining language-based reasoning with quantitative learning creates a transparent and institution-grade foundation for reliable corporate credit-risk assessment.
☆ Redundancy Maximization as a Principle of Associative Memory Learning
Associative memory, traditionally modeled by Hopfield networks, enables the retrieval of previously stored patterns from partial or noisy cues. Yet, the local computational principles which are required to enable this function remain incompletely understood. To formally characterize the local information processing in such systems, we employ a recent extension of information theory - Partial Information Decomposition (PID). PID decomposes the contribution of different inputs to an output into unique information from each input, redundant information across inputs, and synergistic information that emerges from combining different inputs. Applying this framework to individual neurons in classical Hopfield networks we find that below the memory capacity, the information in a neuron's activity is characterized by high redundancy between the external pattern input and the internal recurrent input, while synergy and unique information are close to zero until the memory capacity is surpassed and performance drops steeply. Inspired by this observation, we use redundancy as an information-theoretic learning goal, which is directly optimized for each neuron, dramatically increasing the network's memory capacity to 1.59, a more than tenfold improvement over the 0.14 capacity of classical Hopfield networks and even outperforming recent state-of-the-art implementations of Hopfield networks. Ultimately, this work establishes redundancy maximization as a new design principle for associative memories and opens pathways for new associative memory models based on information-theoretic goals.
comment: 21 pages, 8 figures
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ Directional-Clamp PPO
Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - \beta$ ($1+\beta$), for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.
☆ RIS-Assisted 3D Spherical Splatting for Object Composition Visualization using Detection Transformers
The pursuit of immersive and structurally aware multimedia experiences has intensified interest in sensing modalities that reconstruct objects beyond the limits of visible light. Conventional optical pipelines degrade under occlusion or low illumination, motivating the use of radio-frequency (RF) sensing, whose electromagnetic waves penetrate materials and encode both geometric and compositional information. Yet, uncontrolled multipath propagation restricts reconstruction accuracy. Recent advances in Programmable Wireless Environments (PWEs) mitigate this limitation by enabling software-defined manipulation of propagation through Reconfigurable Intelligent Surfaces (RISs), thereby providing controllable illumination diversity. Building on this capability, this work introduces a PWE-driven RF framework for three-dimensional object reconstruction using material-aware spherical primitives. The proposed approach combines RIS-enabled field synthesis with a Detection Transformer (DETR) that infers spatial and material parameters directly from extracted RF features. Simulation results confirm the framework's ability to approximate object geometries and classify material composition with an overall accuracy of 79.35%, marking an initial step toward programmable and physically grounded RF-based 3D object composition visualization.
comment: Submitted to IEEE ICC 2026
☆ Dynamic Priors in Bayesian Optimization for Hyperparameter Optimization
Hyperparameter optimization (HPO), for example, based on Bayesian optimization (BO), supports users in designing models well-suited for a given dataset. HPO has proven its effectiveness on several applications, ranging from classical machine learning for tabular data to deep neural networks for computer vision and transformers for natural language processing. However, HPO still sometimes lacks acceptance by machine learning experts due to its black-box nature and limited user control. Addressing this, first approaches have been proposed to initialize BO methods with expert knowledge. However, these approaches do not allow for online steering during the optimization process. In this paper, we introduce a novel method that enables repeated interventions to steer BO via user input, specifying expert knowledge and user preferences at runtime of the HPO process in the form of prior distributions. To this end, we generalize an existing method, $\pi$BO, preserving theoretical guarantees. We also introduce a misleading prior detection scheme, which allows protection against harmful user inputs. In our experimental evaluation, we demonstrate that our method can effectively incorporate multiple priors, leveraging informative priors, whereas misleading priors are reliably rejected or overcome. Thereby, we achieve competitiveness to unperturbed BO.
comment: 9 pages
☆ Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning NeurIPS 2025
Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
comment: Accepted to NeurIPS 2025 (Spotlight)
☆ Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
☆ Theoretical Guarantees for Causal Discovery on Large Random Graphs
We investigate theoretical guarantees for the false-negative rate (FNR) -- the fraction of true causal edges whose orientation is not recovered, under single-variable random interventions and an $\epsilon$-interventional faithfulness assumption that accommodates latent confounding. For sparse Erd\H{o}s--R\'enyi directed acyclic graphs, where the edge probability scales as $p_e = \Theta(1/d)$, we show that the FNR concentrates around its mean at rate $O(\frac{\log d}{\sqrt d})$, implying that large deviations above the expected error become exponentially unlikely as dimensionality increases. This concentration ensures that derived upper bounds hold with high probability in large-scale settings. Extending the analysis to generalized Barab\'asi--Albert graphs reveals an even stronger phenomenon: when the degree exponent satisfies $\gamma > 3$, the deviation width scales as $O(d^{\beta - \frac{1}{2}})$ with $\beta = 1/(\gamma - 1) < \frac{1}{2}$, and hence vanishes in the limit. This demonstrates that realistic scale-free topologies intrinsically regularize causal discovery, reducing variability in orientation error. These finite-dimension results provide the first dimension-adaptive, faithfulness-robust guarantees for causal structure recovery, and challenge the intuition that high dimensionality and network heterogeneity necessarily hinder accurate discovery. Our simulation results corroborate these theoretical predictions, showing that the FNR indeed concentrates and often vanishes in practice as dimensionality grows.
☆ Rawlsian many-to-one matching with non-linear utility
We study a many-to-one matching problem, such as the college admission problem, where each college can admit multiple students. Unlike classical models, colleges evaluate sets of students through non-linear utility functions that capture diversity between them. In this setting, we show that classical stable matchings may fail to exist. To address this, we propose alternative solution concepts based on Rawlsian fairness, aiming to maximize the minimum utility across colleges. We design both deterministic and stochastic algorithms that iteratively improve the outcome of the worst-off college, offering a practical approach to fair allocation when stability cannot be guaranteed.
comment: 17 pages, 7 figures
☆ Agentic AI for Mobile Network RAN Management and Optimization
Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making.
☆ Causal Graph Neural Networks for Healthcare
Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
☆ Many-vs-Many Missile Guidance via Virtual Targets
This paper presents a novel approach to many-vs-many missile guidance using virtual targets (VTs) generated by a Normalizing Flows-based trajectory predictor. Rather than assigning n interceptors directly to m physical targets through conventional weapon target assignment algorithms, we propose a centralized strategy that constructs n VT trajectories representing probabilistic predictions of maneuvering target behavior. Each interceptor is guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse flight, transitioning to Proportional Navigation guidance for terminal interception. This approach treats many-vs-many engagements as many-vs-distribution scenarios, exploiting numerical superiority (n > m) by distributing interceptors across diverse trajectory hypotheses rather than pursuing identical deterministic predictions. Monte Carlo simulations across various target-interceptor configurations (1-6 targets, 1-8 interceptors) demonstrate that the VT method matches or exceeds baseline straight-line prediction performance by 0-4.1% when n = m, with improvements increasing to 5.8-14.4% when n > m. The results confirm that probabilistic VTs enable effective exploitation of numerical superiority, significantly increasing interception probability in many-vs-many scenarios.
comment: will be submitted to Journal of Guidance, Control, and Dynamics as Technical Note
☆ An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
☆ Variational Geometric Information Bottleneck: Learning the Shape of Understanding
We propose a unified information-geometric framework that formalizes understanding in learning as a trade-off between informativeness and geometric simplicity. An encoder phi is evaluated by U(phi) = I(phi(X); Y) - beta * C(phi), where C(phi) penalizes curvature and intrinsic dimensionality, enforcing smooth, low-complexity manifolds. Under mild manifold and regularity assumptions, we derive non-asymptotic bounds showing that generalization error scales with intrinsic dimension while curvature controls approximation stability, directly linking geometry to sample efficiency. To operationalize this theory, we introduce the Variational Geometric Information Bottleneck (V-GIB), a variational estimator that unifies mutual-information compression and curvature regularization through tractable geometric proxies such as the Hutchinson trace, Jacobian norms, and local PCA. Experiments across synthetic manifolds, few-shot settings, and real-world datasets (Fashion-MNIST, CIFAR-10) reveal a robust information-geometry Pareto frontier, stable estimators, and substantial gains in interpretive efficiency. Fractional-data experiments on CIFAR-10 confirm that curvature-aware encoders maintain predictive power under data scarcity, validating the predicted efficiency-curvature law. Overall, V-GIB provides a principled and measurable route to representations that are geometrically coherent, data-efficient, and aligned with human-understandable structure.
☆ BRAINS: A Retrieval-Augmented System for Alzheimer's Detection and Monitoring ICML
As the global burden of Alzheimer's disease (AD) continues to grow, early and accurate detection has become increasingly critical, especially in regions with limited access to advanced diagnostic tools. We propose BRAINS (Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address this challenge. This novel system harnesses the powerful reasoning capabilities of Large Language Models (LLMs) for Alzheimer's detection and monitoring. BRAINS features a dual-module architecture: a cognitive diagnostic module and a case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on cognitive and neuroimaging datasets -- including MMSE, CDR scores, and brain volume metrics -- to perform structured assessments of Alzheimer's risk. Meanwhile, the Case Retrieval Module encodes patient profiles into latent representations and retrieves similar cases from a curated knowledge base. These auxiliary cases are fused with the input profile via a Case Fusion Layer to enhance contextual understanding. The combined representation is then processed with clinical prompts for inference. Evaluations on real-world datasets demonstrate BRAINS effectiveness in classifying disease severity and identifying early signs of cognitive decline. This system not only shows strong potential as an assistive tool for scalable, explainable, and early-stage Alzheimer's disease detection, but also offers hope for future applications in the field.
comment: Accepted for publication in ICMLA 2025
☆ Learning CNF formulas from uniform random solutions in the local lemma regime
We study the problem of learning a $n$-variables $k$-CNF formula $\Phi$ from its i.i.d. uniform random solutions, which is equivalent to learning a Boolean Markov random field (MRF) with $k$-wise hard constraints. Revisiting Valiant's algorithm (Commun. ACM'84), we show that it can exactly learn (1) $k$-CNFs with bounded clause intersection size under Lov\'asz local lemma type conditions, from $O(\log n)$ samples; and (2) random $k$-CNFs near the satisfiability threshold, from $\widetilde{O}(n^{\exp(-\sqrt{k})})$ samples. These results significantly improve the previous $O(n^k)$ sample complexity. We further establish new information-theoretic lower bounds on sample complexity for both exact and approximate learning from i.i.d. uniform random solutions.
☆ NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers
Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.
☆ SKGE: Spherical Knowledge Graph Embedding with Geometric Regularization
Knowledge graph embedding (KGE) has become a fundamental technique for representation learning on multi-relational data. Many seminal models, such as TransE, operate in an unbounded Euclidean space, which presents inherent limitations in modeling complex relations and can lead to inefficient training. In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model that challenges this paradigm by constraining entity representations to a compact manifold: a hypersphere. SKGE employs a learnable, non-linear Spherization Layer to map entities onto the sphere and interprets relations as a hybrid translate-then-project transformation. Through extensive experiments on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate that SKGE consistently and significantly outperforms its strong Euclidean counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237 and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We provide an in-depth analysis to reveal the sources of this advantage, showing that this geometric constraint acts as a powerful regularizer, leading to comprehensive performance gains across all relation types. More fundamentally, we prove that the spherical geometry creates an "inherently hard negative sampling" environment, naturally eliminating trivial negatives and forcing the model to learn more robust and semantically coherent representations. Our findings compellingly demonstrate that the choice of manifold is not merely an implementation detail but a fundamental design principle, advocating for geometric priors as a cornerstone for designing the next generation of powerful and stable KGE models.
☆ Accounting for Underspecification in Statistical Claims of Model Superiority
Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emph{underspecification} into account -- the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ($\sim1\%$) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.
comment: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025
☆ An Adaptive Sampling Framework for Detecting Localized Concept Drift under Label Scarcity
Concept drift and label scarcity are two critical challenges limiting the robustness of predictive models in dynamic industrial environments. Existing drift detection methods often assume global shifts and rely on dense supervision, making them ill-suited for regression tasks with local drifts and limited labels. This paper proposes an adaptive sampling framework that combines residual-based exploration and exploitation with EWMA monitoring to efficiently detect local concept drift under labeling budget constraints. Empirical results on synthetic benchmarks and a case study on electricity market demonstrate superior performance in label efficiency and drift detection accuracy.
☆ Improving Unlearning with Model Updates Probably Aligned with Gradients
We formulate the machine unlearning problem as a general constrained optimization problem. It unifies the first-order methods from the approximate machine unlearning literature. This paper then introduces the concept of feasible updates as the model's parameter update directions that help with unlearning while not degrading the utility of the initial model. Our design of feasible updates is based on masking, \ie\ a careful selection of the model's parameters worth updating. It also takes into account the estimation noise of the gradients when processing each batch of data to offer a statistical guarantee to derive locally feasible updates. The technique can be plugged in, as an add-on, to any first-order approximate unlearning methods. Experiments with computer vision classifiers validate this approach.
comment: Accepted to AISec'25 co-located with the 32nd ACM Conference on Computer and Communications Security
☆ Arithmetic Circuits and Neural Networks for Regular Matroids
We prove that there exist uniform $(+,\times,/)$-circuits of size $O(n^3)$ to compute the basis generating polynomial of regular matroids on $n$ elements. By tropicalization, this implies that there exist uniform $(\max,+,-)$-circuits and ReLU neural networks of the same size for weighted basis maximization of regular matroids. As a consequence in linear programming theory, we obtain a first example where taking the difference of two extended formulations can be more efficient than the best known individual extended formulation of size $O(n^6)$ by Aprile and Fiorini. Such differences have recently been introduced as virtual extended formulations. The proof of our main result relies on a fine-tuned version of Seymour's decomposition of regular matroids which allows us to identify and maintain graphic substructures to which we can apply a local version of the star-mesh transformation.
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ A Spatially Informed Gaussian Process UCB Method for Decentralized Coverage Control
We present a novel decentralized algorithm for coverage control in unknown spatial environments modeled by Gaussian Processes (GPs). To trade-off between exploration and exploitation, each agent autonomously determines its trajectory by minimizing a local cost function. Inspired by the GP-UCB (Upper Confidence Bound for GPs) acquisition function, the proposed cost combines the expected locational cost with a variance-based exploration term, guiding agents toward regions that are both high in predicted density and model uncertainty. Compared to previous work, our algorithm operates in a fully decentralized fashion, relying only on local observations and communication with neighboring agents. In particular, agents periodically update their inducing points using a greedy selection strategy, enabling scalable online GP updates. We demonstrate the effectiveness of our algorithm in simulation.
Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds SC 2025
Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
comment: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
☆ H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings ICSE
Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
comment: This is a preprint of a paper to appear at the 15th IEEE International Conference on Systems Engineering and Technology (ICSET 2025)
☆ AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
☆ A new class of Markov random fields enabling lightweight sampling
This work addresses the problem of efficient sampling of Markov random fields (MRF). The sampling of Potts or Ising MRF is most often based on Gibbs sampling, and is thus computationally expensive. We consider in this work how to circumvent this bottleneck through a link with Gaussian Markov Random fields. The latter can be sampled in several cost-effective ways, and we introduce a mapping from real-valued GMRF to discrete-valued MRF. The resulting new class of MRF benefits from a few theoretical properties that validate the new model. Numerical results show the drastic performance gain in terms of computational efficiency, as we sample at least 35x faster than Gibbs sampling using at least 37x less energy, all the while exhibiting empirical properties close to classical MRFs.
☆ LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment
Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence. However, as modern AI agents transition from static knowledge bases to continuous multimodal streams encompassing text, images, video, and audio, two critical challenges arise: maintaining index freshness without prohibitive re-indexing costs, and preserving cross-modal semantic consistency across heterogeneous embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations: (i) a streaming, multi-tier memory system that dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier under strict memory budgets; (ii) a streaming CLAP->CLIP alignment bridge that maintains cross-modal consistency through incremental orthogonal Procrustes updates; and (iii) stability-aware retrieval telemetry providing Safe@k guarantees by jointly bounding alignment drift and quantization error. Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0), establishing LUMA-RAG as a practical framework for production multimodal RAG systems.
☆ Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM 2025
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
comment: Accepted to MMGenSR Workshop (CIKM 2025)
☆ An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
☆ Evolving Graph Learning for Out-of-Distribution Generalization in Non-stationary Environments
Graph neural networks have shown remarkable success in exploiting the spatial and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor generalization ability under distribution shifts, which is inevitable in dynamic scenarios. As dynamic graph generation progresses amid evolving latent non-stationary environments, it is imperative to explore their effects on out-of-distribution (OOD) generalization. This paper proposes a novel Evolving Graph Learning framework for OOD generalization (EvoOOD) by environment-aware invariant pattern recognition. Specifically, we first design an environment sequential variational auto-encoder to model environment evolution and infer the underlying environment distribution. Then, we introduce a mechanism for environment-aware invariant pattern recognition, tailored to address environmental diversification through inferred distributions. Finally, we conduct fine-grained causal interventions on individual nodes using a mixture of instantiated environment samples. This approach helps to distinguish spatio-temporal invariant patterns for OOD prediction, especially in non-stationary environments. Experimental results demonstrate the superiority of EvoGOOD on both real-world and synthetic dynamic datasets under distribution shifts. To the best of our knowledge, it is the first attempt to study the dynamic graph OOD generalization problem from the environment evolution perspective.
☆ Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition NeurIPS 2025
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
comment: 8 pages, 5 figures. Camera-ready manuscript for the Creative AI Track of NeurIPS 2025
☆ Reducing normalizing flow complexity for MCMC preconditioning
Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.
comment: 22 pages, 6 figures
☆ Learning A Universal Crime Predictor with Knowledge-guided Hypernetworks ECAI 2025
Predicting crimes in urban environments is crucial for public safety, yet existing prediction methods often struggle to align the knowledge across diverse cities that vary dramatically in data availability of specific crime types. We propose HYpernetwork-enhanced Spatial Temporal Learning (HYSTL), a framework that can effectively train a unified, stronger crime predictor without assuming identical crime types in different cities' records. In HYSTL, instead of parameterising a dedicated predictor per crime type, a hypernetwork is designed to dynamically generate parameters for the prediction function conditioned on the crime type of interest. To bridge the semantic gap between different crime types, a structured crime knowledge graph is built, where the learned representations of crimes are used as the input to the hypernetwork to facilitate parameter generation. As such, when making predictions for each crime type, the predictor is additionally guided by its intricate association with other relevant crime types. Extensive experiments are performed on two cities with non-overlapping crime types, and the results demonstrate HYSTL outperforms state-of-the-art baselines.
comment: Accepted by ECAI 2025
☆ RoME: Domain-Robust Mixture-of-Experts for MILP Solution Prediction across Domains
Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains. RoME dynamically routes problem instances to specialized experts based on learned task embeddings. The model is trained using a two-level distributionally robust optimization strategy: inter-domain to mitigate global shifts across domains, and intra-domain to enhance local robustness by introducing perturbations on task embeddings. We reveal that cross-domain training not only enhances the model's generalization capability to unseen domains but also improves performance within each individual domain by encouraging the model to capture more general intrinsic combinatorial patterns. Specifically, a single RoME model trained on three domains achieves an average improvement of 67.7% then evaluated on five diverse domains. We further test the pretrained model on MIPLIB in a zero-shot setting, demonstrating its ability to deliver measurable performance gains on challenging real-world instances where existing learning-based approaches often struggle to generalize.
☆ Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning
Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
☆ The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.'s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.
☆ Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
☆ FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and na\"ive FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.
☆ Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series
The rapid growth of industrial Internet of Things (IIoT) systems has created new challenges for anomaly detection in high-dimensional, multivariate time-series, where privacy, scalability, and communication efficiency are critical. Classical federated learning approaches mitigate privacy concerns by enabling decentralized training, but they often struggle with highly non-linear decision boundaries and imbalanced anomaly distributions. To address this gap, we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates quantum feature maps with federated aggregation to enable distributed, privacy-preserving anomaly detection across heterogeneous IoT networks. In our design, quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits and share only these summaries with a central server, which constructs a global Gram matrix and trains a decision function (e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate that FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead. This work highlights the promise of quantum kernels in federated settings, advancing the path toward scalable, robust, and quantum-enhanced intelligence for next-generation IoT infrastructures.
☆ Reinforcement learning based data assimilation for unknown state model
Data assimilation (DA) has increasingly emerged as a critical tool for state estimation across a wide range of applications. It is signiffcantly challenging when the governing equations of the underlying dynamics are unknown. To this end, various machine learning approaches have been employed to construct a surrogate state transition model in a supervised learning framework, which relies on pre-computed training datasets. However, it is often infeasible to obtain noise-free ground-truth state sequences in practice. To address this challenge, we propose a novel method that integrates reinforcement learning with ensemble-based Bayesian ffltering methods, enabling the learning of surrogate state transition model for unknown dynamics directly from noisy observations, without using true state trajectories. Speciffcally, we treat the process for computing maximum likelihood estimation of surrogate model parameters as a sequential decision-making problem, which can be formulated as a discretetime Markov decision process (MDP). Under this formulation, learning the surrogate transition model is equivalent to ffnding an optimal policy of the MDP, which can be effectively addressed using reinforcement learning techniques. Once the model is trained offfine, state estimation can be performed in the online stage using ffltering methods based on the learned dynamics. The proposed framework accommodates a wide range of observation scenarios, including nonlinear and partially observed measurement models. A few numerical examples demonstrate that the proposed method achieves superior accuracy and robustness in high-dimensional settings.
☆ Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness NeurIPS 2025
Smoothness is known to be crucial for acceleration in offline optimization, and for gradient-variation regret minimization in online learning. Interestingly, these two problems are actually closely connected -- accelerated optimization can be understood through the lens of gradient-variation online learning. In this paper, we investigate online learning with H\"older smooth functions, a general class encompassing both smooth and non-smooth (Lipschitz) functions, and explore its implications for offline optimization. For (strongly) convex online functions, we design the corresponding gradient-variation online learning algorithm whose regret smoothly interpolates between the optimal guarantees in smooth and non-smooth regimes. Notably, our algorithms do not require prior knowledge of the H\"older smoothness parameter, exhibiting strong adaptivity over existing methods. Through online-to-batch conversion, this gradient-variation online adaptivity yields an optimal universal method for stochastic convex optimization under H\"older smoothness. However, achieving universality in offline strongly convex optimization is more challenging. We address this by integrating online adaptivity with a detection-based guess-and-check procedure, which, for the first time, yields a universal offline method that achieves accelerated convergence in the smooth regime while maintaining near-optimal convergence in the non-smooth one.
comment: NeurIPS 2025
☆ Probabilistic Graph Cuts
Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.
comment: 23 pages
☆ Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.
☆ From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.
comment: 16 pages, 13 figures
☆ Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.
☆ Neural network initialization with nonlinear characteristics and information on spectral bias
Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.
comment: 7 pages, 7 figures
☆ Structural Plasticity as Active Inference: A Biologically-Inspired Architecture for Homeostatic Control
Traditional neural networks, while powerful, rely on biologically implausible learning mechanisms such as global backpropagation. This paper introduces the Structurally Adaptive Predictive Inference Network (SAPIN), a novel computational model inspired by the principles of active inference and the morphological plasticity observed in biological neural cultures. SAPIN operates on a 2D grid where processing units, or cells, learn by minimizing local prediction errors. The model features two primary, concurrent learning mechanisms: a local, Hebbian-like synaptic plasticity rule based on the temporal difference between a cell's actual activation and its learned expectation, and a structural plasticity mechanism where cells physically migrate across the grid to optimize their information-receptive fields. This dual approach allows the network to learn both how to process information (synaptic weights) and also where to position its computational resources (network topology). We validated the SAPIN model on the classic Cart Pole reinforcement learning benchmark. Our results demonstrate that the architecture can successfully solve the CartPole task, achieving robust performance. The network's intrinsic drive to minimize prediction error and maintain homeostasis was sufficient to discover a stable balancing policy. We also found that while continual learning led to instability, locking the network's parameters after achieving success resulted in a stable policy. When evaluated for 100 episodes post-locking (repeated over 100 successful agents), the locked networks maintained an average 82% success rate.
☆ Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.
comment: 18 pages, 9 figures, 10 tables
☆ Learning Interactive World Model for Object-Centric Reinforcement Learning NeurIPS 2025
Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
comment: NeurIPS 2025
☆ Optimizing Multi-Lane Intersection Performance in Mixed Autonomy Environments
One of the main challenges in managing traffic at multilane intersections is ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs). This paper presents a novel traffic signal control framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic (SAC) reinforcement learning to address this challenge. GATs are used to model the dynamic graph- structured nature of traffic flow to capture spatial and temporal dependencies between lanes and signal phases. The proposed SAC is a robust off-policy reinforcement learning algorithm that enables adaptive signal control through entropy-optimized decision making. This design allows the system to coordinate the signal timing and vehicle movement simultaneously with objectives focused on minimizing travel time, enhancing performance, ensuring safety, and improving fairness between HDVs and CAVs. The model is evaluated using a SUMO-based simulation of a four-way intersection and incorporating different traffic densities and CAV penetration rates. The experimental results demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1% reduction in average delay and up to 29.2% fewer traffic violations compared to traditional methods. Additionally, the fairness ratio between HDVs and CAVs improved to 1.59, indicating more equitable treatment across vehicle types. These findings suggest that the GAT-SAC framework holds significant promise for real-world deployment in mixed-autonomy traffic systems.
☆ Training Proactive and Personalized LLM Agents
While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
☆ OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
comment: 25 pages, 12 figures, 8 tables
☆ Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.
☆ PrivGNN: High-Performance Secure Inference for Cryptographic Graph Neural Networks
Graph neural networks (GNNs) are powerful tools for analyzing and learning from graph-structured (GS) data, facilitating a wide range of services. Deploying such services in privacy-critical cloud environments necessitates the development of secure inference (SI) protocols that safeguard sensitive GS data. However, existing SI solutions largely focus on convolutional models for image and text data, leaving the challenge of securing GNNs and GS data relatively underexplored. In this work, we design, implement, and evaluate $\sysname$, a lightweight cryptographic scheme for graph-centric inference in the cloud. By hybridizing additive and function secret sharings within secure two-party computation (2PC), $\sysname$ is carefully designed based on a series of novel 2PC interactive protocols that achieve $1.5\times \sim 1.7\times$ speedups for linear layers and $2\times \sim 15\times$ for non-linear layers over state-of-the-art (SotA) solutions. A thorough theoretical analysis is provided to prove $\sysname$'s correctness, security, and lightweight nature. Extensive experiments across four datasets demonstrate $\sysname$'s superior efficiency with $1.3\times \sim 4.7\times$ faster secure predictions while maintaining accuracy comparable to plaintext graph property inference.
comment: Accepted to FC'25
☆ Tackling Incomplete Data in Air Quality Prediction: A Bayesian Deep Learning Framework for Uncertainty Quantification
Accurate air quality forecasts are vital for public health alerts, exposure assessment, and emissions control. In practice, observational data are often missing in varying proportions and patterns due to collection and transmission issues. These incomplete spatiotemporal records impede reliable inference and risk assessment and can lead to overconfident extrapolation. To address these challenges, we propose an end to end framework, the channel gated learning unit based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features with a graph attention encoder to capture multiscale spatial dependencies and seasonal temporal dynamics. A channel gated learning unit, equipped with learnable activations and gated residual connections, adaptively filters and amplifies informative features. Bayesian inference jointly optimizes predictive distributions and parameter uncertainty, producing point estimates and calibrated prediction intervals. We conduct a systematic evaluation on two real world datasets, covering four typical missing data patterns and comparing against five state of the art baselines. CGLUBNF achieves superior prediction accuracy and sharper confidence intervals. In addition, we further validate robustness across multiple prediction horizons and analysis the contribution of extraneous variables. This research lays a foundation for reliable deep learning based spatio-temporal forecasting with incomplete observations in emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
☆ Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs
As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.
☆ ScenicProver: A Framework for Compositional Probabilistic Verification of Learning-Enabled Systems
Full verification of learning-enabled cyber-physical systems (CPS) has long been intractable due to challenges including black-box components and complex real-world environments. Existing tools either provide formal guarantees for limited types of systems or test the system as a monolith, but no general framework exists for compositional analysis of learning-enabled CPS using varied verification techniques over complex real-world environments. This paper introduces ScenicProver, a verification framework that aims to fill this gap. Built upon the Scenic probabilistic programming language, the framework supports: (1) compositional system description with clear component interfaces, ranging from interpretable code to black boxes; (2) assume-guarantee contracts over those components using an extension of Linear Temporal Logic containing arbitrary Scenic expressions; (3) evidence generation through testing, formal proofs via Lean 4 integration, and importing external assumptions; (4) systematic combination of generated evidence using contract operators; and (5) automatic generation of assurance cases tracking the provenance of system-level guarantees. We demonstrate the framework's effectiveness through a case study on an autonomous vehicle's automatic emergency braking system with sensor fusion. By leveraging manufacturer guarantees for radar and laser sensors and focusing testing efforts on uncertain conditions, our approach enables stronger probabilistic guarantees than monolithic testing with the same computational budget.
comment: 26 pages, 4 figures. Full version (including appendices) of a paper submitted to TACAS 2026
☆ Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov Games
No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper $\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
☆ ProtoTSNet: Interpretable Multivariate Time Series Classification With Prototypical Parts
Time series data is one of the most popular data modalities in critical domains such as industry and medicine. The demand for algorithms that not only exhibit high accuracy but also offer interpretability is crucial in such fields, as decisions made there bear significant consequences. In this paper, we present ProtoTSNet, a novel approach to interpretable classification of multivariate time series data, through substantial enhancements to the ProtoPNet architecture. Our method is tailored to overcome the unique challenges of time series analysis, including capturing dynamic patterns and handling varying feature significance. Central to our innovation is a modified convolutional encoder utilizing group convolutions, pre-trainable as part of an autoencoder and designed to preserve and quantify feature importance. We evaluated our model on 30 multivariate time series datasets from the UEA archive, comparing our approach with existing explainable methods as well as non-explainable baselines. Through comprehensive evaluation and ablation studies, we demonstrate that our approach achieves the best performance among ante-hoc explainable methods while maintaining competitive performance with non-explainable and post-hoc explainable approaches, providing interpretable results accessible to domain experts.
comment: 30 pages, 10 figures
☆ CFL: On the Use of Characteristic Function Loss for Domain Alignment in Machine Learning
Machine Learning (ML) models are extensively used in various applications due to their significant advantages over traditional learning methods. However, the developed ML models often underperform when deployed in the real world due to the well-known distribution shift problem. This problem can lead to a catastrophic outcomes when these decision-making systems have to operate in high-risk applications. Many researchers have previously studied this problem in ML, known as distribution shift problem, using statistical techniques (such as Kullback-Leibler, Kolmogorov-Smirnov Test, Wasserstein distance, etc.) to quantify the distribution shift. In this letter, we show that using Characteristic Function (CF) as a frequency domain approach is a powerful alternative for measuring the distribution shift in high-dimensional space and for domain adaptation.
☆ Disentangling Causal Substructures for Interpretable and Generalizable Drug Synergy Prediction
Drug synergy prediction is a critical task in the development of effective combination therapies for complex diseases, including cancer. Although existing methods have shown promising results, they often operate as black-box predictors that rely predominantly on statistical correlations between drug characteristics and results. To address this limitation, we propose CausalDDS, a novel framework that disentangles drug molecules into causal and spurious substructures, utilizing the causal substructure representations for predicting drug synergy. By focusing on causal sub-structures, CausalDDS effectively mitigates the impact of redundant features introduced by spurious substructures, enhancing the accuracy and interpretability of the model. In addition, CausalDDS employs a conditional intervention mechanism, where interventions are conditioned on paired molecular structures, and introduces a novel optimization objective guided by the principles of sufficiency and independence. Extensive experiments demonstrate that our method outperforms baseline models, particularly in cold start and out-of-distribution settings. Besides, CausalDDS effectively identifies key substructures underlying drug synergy, providing clear insights into how drug combinations work at the molecular level. These results underscore the potential of CausalDDS as a practical tool for predicting drug synergy and facilitating drug discovery.
☆ QuPCG: Quantum Convolutional Neural Network for Detecting Abnormal Patterns in PCG Signals
Early identification of abnormal physiological patterns is essential for the timely detection of cardiac disease. This work introduces a hybrid quantum-classical convolutional neural network (QCNN) designed to classify S3 and murmur abnormalities in heart sound signals. The approach transforms one-dimensional phonocardiogram (PCG) signals into compact two-dimensional images through a combination of wavelet feature extraction and adaptive threshold compression methods. We compress the cardiac-sound patterns into an 8-pixel image so that only 8 qubits are needed for the quantum stage. Preliminary results on the HLS-CMDS dataset demonstrate 93.33% classification accuracy on the test set and 97.14% on the train set, suggesting that quantum models can efficiently capture temporal-spectral correlations in biomedical signals. To our knowledge, this is the first application of a QCNN algorithm for bioacoustic signal processing. The proposed method represents an early step toward quantum-enhanced diagnostic systems for resource-constrained healthcare environments.
☆ DoFlow: Causal Generative Flows for Interventional and Counterfactual Time-Series Prediction
Time-series forecasting increasingly demands not only accurate observational predictions but also causal forecasting under interventional and counterfactual queries in multivariate systems. We present DoFlow, a flow based generative model defined over a causal DAG that delivers coherent observational and interventional predictions, as well as counterfactuals through the natural encoding and decoding mechanism of continuous normalizing flows (CNFs). We also provide a supporting counterfactual recovery result under certain assumptions. Beyond forecasting, DoFlow provides explicit likelihoods of future trajectories, enabling principled anomaly detection. Experiments on synthetic datasets with various causal DAG and real world hydropower and cancer treatment time series show that DoFlow achieves accurate system-wide observational forecasting, enables causal forecasting over interventional and counterfactual queries, and effectively detects anomalies. This work contributes to the broader goal of unifying causal reasoning and generative modeling for complex dynamical systems.
♻ ☆ Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy
Why and when is deep better than shallow? We answer this question in a framework that is agnostic to network implementation. We formulate a deep model as an abstract state-transition semigroup acting on a general metric space, and separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition. We prove a bias-variance decomposition in which the variance depends only on the abstract depth-$k$ network and not on the implementation (Theorem 1). We further split the bounds into output and hidden parts to tie the depth dependence of the variance to the metric entropy of the state-transition semigroup (Theorem 2). We then investigate implementation-free conditions under which the variance grow polynomially or logarithmically with depth (Section 4). Combining these with exponential or polynomial bias decay identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and produces explicit optimal depths $k^\ast$. Across regimes, $k^\ast>1$ typically holds, giving a rigorous form of depth supremacy. The lowest generalization error bound is achieved under the EL regime (exp-decay bias + log-growth variance), explaining why and when deep is better, especially for iterative or hierarchical concept classes such as neural ODEs, diffusion/score-matching models, and chain-of-thought reasoning.
♻ ☆ Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
♻ ☆ Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders
Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.
♻ ☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
comment: Clarified expectation-value-based readouts and made minor text edits
♻ ☆ DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging NeurIPS 2025
Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard the training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping the training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code: https://github.com/FelixWag/DIsoN
comment: Accepted at NeurIPS 2025
♻ ☆ Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.
comment: Controlled Comparative Study added
♻ ☆ Inducing Riesz and orthonormal bases in $L^2$ via composition operators
Let $C_h$ be a composition operator mapping $L^2(\Omega_1)$ into $L^2(\Omega_2)$ for some open sets $\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$. We characterize the mappings $h$ that transform Riesz bases of $L^2(\Omega_1)$ into Riesz bases of $L^2(\Omega_2)$. Restricting our analysis to differentiable mappings, we demonstrate that mappings $h$ that preserve Riesz bases have Jacobian determinants that are bounded away from zero and infinity. We discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct Riesz bases with favorable approximation properties.
♻ ☆ Efficient Learning of Quantum States Prepared With Few Non-Clifford Gates
We give a pair of algorithms that efficiently learn a quantum state prepared by Clifford gates and $O(\log n)$ non-Clifford gates. Specifically, for an $n$-qubit state $|\psi\rangle$ prepared with at most $t$ non-Clifford gates, our algorithms use $\mathsf{poly}(n,2^t,1/\varepsilon)$ time and copies of $|\psi\rangle$ to learn $|\psi\rangle$ to trace distance at most $\varepsilon$. The first algorithm for this task is more efficient, but requires entangled measurements across two copies of $|\psi\rangle$. The second algorithm uses only single-copy measurements at the cost of polynomial factors in runtime and sample complexity. Our algorithms more generally learn any state with sufficiently large stabilizer dimension, where a quantum state has stabilizer dimension $k$ if it is stabilized by an abelian group of $2^k$ Pauli operators. We also develop an efficient property testing algorithm for stabilizer dimension, which may be of independent interest.
comment: 54 pages. v4 to be published in Quantum journal. Merged v3 with arXiv:2308.07175. This version now subsumes arXiv:2308.07175
♻ ☆ Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design
Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.
♻ ☆ Position: Bridge the Gaps between Machine Unlearning and AI Regulation NeurIPS 2025
The ''right to be forgotten'' and the data privacy laws that encode it have motivated machine unlearning since its earliest days. Now, some argue that an inbound wave of artificial intelligence regulations -- like the European Union's Artificial Intelligence Act (AIA) -- may offer important new use cases for machine unlearning. However, this position paper argues, this opportunity will only be realized if researchers proactively bridge the (sometimes sizable) gaps between machine unlearning's state of the art and its potential applications to AI regulation. To demonstrate this point, we use the AIA as our primary case study. Specifically, we deliver a ``state of the union'' as regards machine unlearning's current potential (or, in many cases, lack thereof) for aiding compliance with various provisions of the AIA. This starts with a precise cataloging of the potential applications of machine unlearning to AIA compliance. For each, we flag the technical gaps that exist between the potential application and the state of the art of machine unlearning. Finally, we end with a call to action: for machine learning researchers to solve the open technical questions that could unlock machine unlearning's potential to assist compliance with the AIA -- and other AI regulations like it.
comment: NeurIPS 2025 Position Paper Track Oral, https://openreview.net/forum?id=0ngi2StMwC
♻ ☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev\.minimum, Elev\.L3, lev\.MAD\.mode, Elev\.mode, Elev\.MAD\.median, and Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
♻ ☆ A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios
The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model's general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.
♻ ☆ Bayesian Optimization by Kernel Regression and Density-based Exploration
Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the high computational complexity of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose the Bayesian Optimization by Kernel regression and density-based Exploration (BOKE) algorithm. BOKE uses kernel regression for efficient function approximation, kernel density for exploration, and integrates them into the confidence bound criteria to guide the optimization process, thus reducing computational costs to quadratic. Our theoretical analysis rigorously establishes the global convergence of BOKE and ensures its robustness in noisy settings. Through extensive numerical experiments on both synthetic and real-world optimization tasks, we demonstrate that BOKE not only performs competitively compared to Gaussian process-based methods and several other baseline methods but also exhibits superior computational efficiency. These results highlight BOKE's effectiveness in resource-constrained environments, providing a practical approach for optimization problems in engineering applications.
♻ ☆ Realizable Circuit Complexity: Embedding Computation in Space-Time
Classical circuit complexity characterizes parallel computation in purely combinatorial terms, ignoring the physical constraints that govern real hardware. The standard classes $\mathbf{NC}$, $\mathbf{AC}$, and $\mathbf{TC}$ treat unlimited fan-in, free interconnection, and polynomial gate counts as feasible -- assumptions that conflict with geometric, energetic, and thermodynamic realities. We introduce the family of \textit{realizable circuit classes} $\mathbf{RC}_d$, which model computation embedded in physical $d$-dimensional space. Each circuit in $\mathbf{RC}_d$ obeys conservative realizability laws: volume scales as $\mathcal{O}(t^d)$, cross-boundary information flux is bounded by $\mathcal{O}(t^{d-1})$ per unit time, and growth occurs through local, physically constructible edits. These bounds apply to all causal systems, classical or quantum. Within this framework, we show that algorithms with runtime $\omega(n^{d/(d-1)})$ cannot scale to inputs of maximal entropy, and that any $d$-dimensional parallel implementation offers at most a polynomial speed-up of degree $(d-1)$ over its optimal sequential counterpart. In the limit $d\to\infty$, $\mathbf{RC}_\infty(\mathrm{polylog})=\mathbf{NC}$, recovering classical parallelism as a non-physical idealization. By unifying geometry, causality, and information flow, $\mathbf{RC}_d$ extends circuit complexity into the physical domain, revealing universal scaling laws for computation.
comment: 38 pages
♻ ☆ The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
♻ ☆ Image Super-Resolution with Guarantees via Conformalized Generative Models NeurIPS 2025
The increasing use of generative ML foundation models for image restoration tasks such as super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a 'confidence mask' capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method's solid performance.
comment: To appear at NeurIPS 2025. 17 pages, 7 figures
♻ ☆ Noise-based reward-modulated learning
The pursuit of energy-efficient and adaptive artificial intelligence (AI) has positioned neuromorphic computing as a promising alternative to conventional computing. However, achieving learning on these platforms requires techniques that prioritize local information while enabling effective credit assignment. Here, we propose noise-based reward-modulated learning (NRL), a novel synaptic plasticity rule that mathematically unifies reinforcement learning and gradient-based optimization with biologically-inspired local updates. NRL addresses the computational bottleneck of exact gradients by approximating them through stochastic neural activity, transforming the inherent noise of biological and neuromorphic substrates into a functional resource. Drawing inspiration from biological learning, our method uses reward prediction errors as its optimization target to generate increasingly advantageous behavior, and eligibility traces to facilitate retrospective credit assignment. Experimental validation on reinforcement tasks, featuring immediate and delayed rewards, shows that NRL achieves performance comparable to baselines optimized using backpropagation, although with slower convergence, while showing significantly superior performance and scalability in multi-layer networks compared to reward-modulated Hebbian learning (RMHL), the most prominent similar approach. While tested on simple architectures, the results highlight the potential of noise-driven, brain-inspired learning for low-power adaptive systems, particularly in computing substrates with locality constraints. NRL offers a theoretically grounded paradigm well-suited for the event-driven characteristics of next-generation neuromorphic AI.
♻ ☆ Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing
Learning causal structure from observational data is especially challenging when latent variables or selection bias are present. The Fast Causal Inference (FCI) algorithm addresses this setting but often performs exhaustive conditional independence tests across many subsets, leading to spurious independence claims, extra or missing edges, and unreliable orientations. We present a family of score-guided mixed-strategy causal search algorithms that build on this tradition. First, we introduce BOSS-FCI and GRaSP-FCI, straightforward variants of GFCI that substitute BOSS or GRaSP for FGES, thereby retaining correctness while incurring different scalability tradeoffs. Second, we develop FCI Targeted-testing (FCIT), a novel mixed-strategy method that improves upon these variants by replacing exhaustive all-subsets testing with targeted tests guided by BOSS, yielding well-formed PAGs with higher precision and efficiency. Finally, we propose a simple heuristic, LV-Dumb (also known as BOSS-POD), which bypasses latent-variable-specific reasoning and directly returns the PAG of the BOSS DAG. Although not strictly correct in the FCI sense, it scales better and often achieves superior accuracy in practice. Simulations and real-data analyses demonstrate that BOSS-FCI and GRaSP-FCI provide sound baselines, FCIT improves both efficiency and reliability, and LV-Dumb offers a practical heuristic with strong empirical performance. Together, these method highlight the value of score-guided and targeted strategies for scalable latent-variable causal discovery.
comment: 30 pages, 44 figures, 6 tables
♻ ☆ Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation EMNLP'25
This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than ${1}\%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to ${37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by ${11}\%$ on average.
comment: 19 pages. EMNLP'25
♻ ☆ Universal Sequence Preconditioning
We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally table and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.
comment: 35 pages, 3 tables, 5 figures
♻ ☆ Discrete and Continuous Difference of Submodular Minimization
Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of two submodular (DS) functions, over both domains, extending prior work restricted to set functions. We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.
♻ ☆ Repetitions are not all alike: distinct mechanisms sustain repetition in language models
Large Language Models (LLMs) can sometimes degrade into repetitive loops, persistently generating identical word sequences. Because repetition is rare in natural human language, its frequent occurrence across diverse tasks and contexts in LLMs remains puzzling. Here we investigate whether behaviorally similar repetition patterns arise from distinct underlying mechanisms and how these mechanisms develop during model training. We contrast two conditions: repetitions elicited by natural text prompts with those induced by in-context learning (ICL) setups that explicitly require copying behavior. Our analyses reveal that ICL-induced repetition relies on a dedicated network of attention heads that progressively specialize over training, whereas naturally occurring repetition emerges early and lacks a defined circuitry. Attention inspection further shows that natural repetition focuses disproportionately on low-information tokens, suggesting a fallback behavior when relevant context cannot be retrieved. These results indicate that superficially similar repetition behaviors originate from qualitatively different internal processes, reflecting distinct modes of failure and adaptation in language models.
♻ ☆ Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
comment: Not enough evidence to prove the effectiveness of the system in the context of learning about climate change
♻ ☆ FORTALESA: Fault-Tolerant Reconfigurable Systolic Array for DNN Inference
The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode--layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to $3\times$, depending on layer vulnerability. Furthermore, it requires $6\times$ fewer resources compared to static redundancy and $2.5\times$ fewer resources compared to the previously proposed solution for transient faults.
comment: 13 pages, 15 figures
♻ ☆ PyDPF: A Python Package for Differentiable Particle Filtering
State-space models (SSMs) are a widely used tool in time series analysis. In the complex systems that arise from real-world data, it is common to employ particle filtering (PF), an efficient Monte Carlo method for estimating the hidden state corresponding to a sequence of observations. Applying particle filtering requires specifying both the parametric form and the parameters of the system, which are often unknown and must be estimated. Gradient-based optimisation techniques cannot be applied directly to standard particle filters, as the filters themselves are not differentiable. However, several recently proposed methods modify the resampling step to make particle filtering differentiable. In this paper, we present an implementation of several such differentiable particle filters (DPFs) with a unified API built on the popular PyTorch framework. Our implementation makes these algorithms easily accessible to a broader research community and facilitates straightforward comparison between them. We validate our framework by reproducing experiments from several existing studies and demonstrate how DPFs can be applied to address several common challenges with state space modelling.
comment: 42 pages, 0 figures, under review at the Journal of Statistical Software, the python package can be found at https://pypi.org/project/pydpf/ , the full documentation at https://python-dpf.readthedocs.io/en/latest/#documentation-index , and the source code including experiments and replication material at https://github.com/John-JoB/pydpf
♻ ☆ AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
comment: Code: https://github.com/facebookresearch/aira-dojo
♻ ☆ Multiscale spatiotemporal heterogeneity analysis of bike-sharing system's self-loop phenomenon: Evidence from Shanghai
Bike-sharing is an environmentally friendly shared mobility mode, but its self-loop phenomenon, where bikes are returned to the same station after several time usage, significantly impacts equity in accessing its services. Therefore, this study conducts a multiscale analysis with a spatial autoregressive model and double machine learning framework to assess socioeconomic features and geospatial location's impact on the self-loop phenomenon at metro stations and street scales. The results reveal that bike-sharing self-loop intensity exhibits significant spatial lag effect at street scale and is positively associated with residential land use. Marginal treatment effects of residential land use is higher on streets with middle-aged residents, high fixed employment, and low car ownership. The multimodal public transit condition reveals significant positive marginal treatment effects at both scales. To enhance bike-sharing cooperation, we advocate augmenting bicycle availability in areas with high metro usage and low bus coverage, alongside implementing adaptable redistribution strategies.
comment: Critical OD data calibration errors in Sections 3.2/4.1 (invalidating indices, undermining conclusions), planning a revised validated version
♻ ☆ Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning NeurIPS 2025
Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
comment: Accepted to NeurIPS 2025. Camera-ready version. 33 pages, 5 figures. Code available at: https://github.com/tran-khoa/tmcl
♻ ☆ Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning
Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.
♻ ☆ Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.
♻ ☆ Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey
This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.
♻ ☆ Training on Plausible Counterfactuals Removes Spurious Correlations
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.
♻ ☆ From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators
In recent years, Neural Operators(NO) have gradually emerged as a popular approach for solving Partial Differential Equations (PDEs). However, their application to large-scale engineering tasks suffers from significant computational overhead. And the fact that current models impose a uniform computational cost while physical fields exhibit vastly different complexities constitutes a fundamental mismatch, which is the root of this inefficiency. For instance, in turbulence flows, intricate vortex regions require deeper network processing compared to stable flows. To address this, we introduce a framework: Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators, capable of being integrated into their multi-layer architectures. First, SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference. Then, in later layers, it decides how many tokens are passed forward based on this ranking. This way, the model focuses more processing capacity on the tokens that are more complex. Experiments demonstrate that SBR is a general framework that seamlessly integrates into various neural operators. Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs), while still delivering up to 2x faster inference without sacrificing accuracy.
♻ ☆ ABS: Enforcing Constraint Satisfaction On Generated Sequences Via Automata-Guided Beam Search
Sequence generation and prediction form a cornerstone of modern machine learning, with applications spanning natural language processing, program synthesis, and time-series forecasting. These tasks are typically modeled in an autoregressive fashion, where each token is generated conditional on the preceding ones, and beam search is commonly used to balance exploration and fluency during decoding. While deep learning models and Large Language Models (LLMs) excel at capturing statistical patterns in this setting, they remain ill-equipped to guarantee compliance with formal constraints. In this paper, we introduce ABS: a general and model-agnostic inference-time algorithm that guarantees compliance with any constraint that can be compiled into a Deterministic Finite Automaton (DFA), without requiring retraining. ABS leverages the DFA to guide a constrained variant of beam search: at each decoding step, transitions leading to violations are masked, while remaining paths are dynamically re-ranked according to both the model's probabilities and the automaton's acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given constraints, and we empirically demonstrate that ABS also improves output quality. We validate our approach on three distinct tasks: constrained image-stream classification, controlled text generation, and text infilling. In all settings, ABS achieves perfect constraint satisfaction, while outperforming or matching state-of-the-art baselines on standard quality metrics and efficiency.
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
comment: ICSPS 2025
♻ ☆ Constrained Optimal Fuel Consumption of HEVs under Observational Noise
In our prior work, we investigated the minimum fuel consumption of a hybrid electric vehicle (HEV) under a state-of-charge (SOC) balance constraint, assuming perfect SOC measurements and accurate reference speed profiles. The constrained optimal fuel consumption (COFC) problem was addressed using a constrained reinforcement learning (CRL) framework. However, in real-world scenarios, SOC readings are often corrupted by sensor noise, and reference speeds may deviate from actual driving conditions. To account for these imperfections, this study reformulates the COFC problem by explicitly incorporating observational noise in both SOC and reference speed. We adopt a robust CRL approach, where the noise is modeled as a uniform distribution, and employ a structured training procedure to ensure stability. The proposed method is evaluated through simulations on the Toyota Prius hybrid system (THS), using both the New European Driving Cycle (NEDC) and the Worldwide Harmonized Light Vehicles Test Cycle (WLTC). Results show that fuel consumption and SOC constraint satisfaction remain robust across varying noise levels. Furthermore, the analysis reveals that observational noise in SOC and speed can impact fuel consumption to different extents. To the best of our knowledge, this is the first study to explicitly examine how observational noise -- commonly encountered in dynamometer testing and predictive energy control (PEC) applications -- affects constrained optimal fuel consumption in HEVs.
comment: Minor text and figure adjustments; no substantive changes
♻ ☆ OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
♻ ☆ Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models NeurIPS 2025
In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker--Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker--Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
comment: Accepted at Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Astromer 2
Foundational models have emerged as a powerful paradigm in deep learning field, leveraging their capacity to learn robust representations from large-scale datasets and effectively to diverse downstream applications such as classification. In this paper, we present Astromer 2 a foundational model specifically designed for extracting light curve embeddings. We introduce Astromer 2 as an enhanced iteration of our self-supervised model for light curve analysis. This paper highlights the advantages of its pre-trained embeddings, compares its performance with that of its predecessor, Astromer 1, and provides a detailed empirical analysis of its capabilities, offering deeper insights into the model's representations. Astromer 2 is pretrained on 1.5 million single-band light curves from the MACHO survey using a self-supervised learning task that predicts randomly masked observations within sequences. Fine-tuning on a smaller labeled dataset allows us to assess its performance in classification tasks. The quality of the embeddings is measured by the F1 score of an MLP classifier trained on Astromer-generated embeddings. Our results demonstrate that Astromer 2 significantly outperforms Astromer 1 across all evaluated scenarios, including limited datasets of 20, 100, and 500 samples per class. The use of weighted per-sample embeddings, which integrate intermediate representations from Astromer's attention blocks, is particularly impactful. Notably, Astromer 2 achieves a 15% improvement in F1 score on the ATLAS dataset compared to prior models, showcasing robust generalization to new datasets. This enhanced performance, especially with minimal labeled data, underscores the potential of Astromer 2 for more efficient and scalable light curve analysis.
comment: 10 pages, 17 figures
♻ ☆ Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number NeurIPS 2025
Structure-based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off-target effects. We propose PAFlow, a novel target-aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein-ligand interaction predictor is incorporated to guide the vector field toward higher-affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state-of-the-art in binding affinity (up to -8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.
comment: Accepted by NeurIPS 2025
♻ ☆ MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
♻ ☆ LoLaFL: Low-Latency Federated Learning via Forward-only Propagation
Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 87\% and 97\%, respectively, while maintaining comparable accuracies.
comment: 16 pages, 10 figures
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
♻ ☆ Language-Agnostic Modeling of Source Reliability on Wikipedia
Over the last few years, verifying the credibility of information sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of web domains as sources in references across multiple language editions of Wikipedia. Utilizing editing activity data, the model evaluates domain reliability within different articles of varying controversiality, such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts domain reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65, while the performance of low-resource languages varies. In all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. We believe these findings can assist Wikipedia editors in their ongoing efforts to verify citations and may offer useful insights for other user-generated content communities.
♻ ☆ Improving Uncertainty Estimation through Semantically Diverse Language Generation ICLR 2025
Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.
comment: ICLR 2025
♻ ☆ Autoencoding Random Forests NeurIPS 2025
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
comment: 10 pages main text, 27 pages total. 9 figures, 4 tables. To be published in proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ GRAM-DTI: adaptive multimodal representation learning for drug target interaction prediction
Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
♻ ☆ Evolutionary Machine Learning meets Self-Supervised Learning: a comprehensive survey
The number of studies that combine Evolutionary Machine Learning and self-supervised learning has been growing steadily in recent years. Evolutionary Machine Learning has been shown to help automate the design of machine learning algorithms and to lead to more reliable solutions. Self-supervised learning, on the other hand, has produced good results in learning useful features when labelled data is limited. This suggests that the combination of these two areas can help both in shaping evolutionary processes and in automating the design of deep neural networks, while also reducing the need for labelled data. Still, there are no detailed reviews that explain how Evolutionary Machine Learning and self-supervised learning can be used together. To help with this, we provide an overview of studies that bring these areas together. Based on this growing interest and the range of existing works, we suggest a new sub-area of research, which we call Evolutionary Self-Supervised Learning and introduce a taxonomy for it. Finally, we point out some of the main challenges and suggest directions for future research to help Evolutionary Self-Supervised Learning grow and mature as a field.
♻ ☆ AI-driven software for automated quantification of skeletal metastases and treatment response evaluation using Whole-Body Diffusion-Weighted MRI (WB-DWI) in Advanced Prostate Cancer
Quantitative assessment of treatment response in Advanced Prostate Cancer (APC) with bone metastases remains an unmet clinical need. Whole-Body Diffusion-Weighted MRI (WB-DWI) provides two response biomarkers: Total Diffusion Volume (TDV) and global Apparent Diffusion Coefficient (gADC). However, tracking post-treatment changes of TDV and gADC from manually delineated lesions is cumbersome and increases inter-reader variability. We developed a software to automate this process. Core technologies include: (i) a weakly-supervised Residual U-Net model generating a skeleton probability map to isolate bone; (ii) a statistical framework for WB-DWI intensity normalisation, obtaining a signal-normalised b=900s/mm^2 (b900) image; and (iii) a shallow convolutional neural network that processes outputs from (i) and (ii) to generate a mask of suspected bone lesions, characterised by higher b900 signal intensity due to restricted water diffusion. This mask is applied to the gADC map to extract TDV and gADC statistics. We tested the tool using expert-defined metastatic bone disease delineations on 66 datasets, assessed repeatability of imaging biomarkers (N=10), and compared software-based response assessment with a construct reference standard (N=118). Average dice score between manual and automated delineations was 0.6 for lesions within pelvis and spine, with an average surface distance of 2mm. Relative differences for log-transformed TDV (log-TDV) and median gADC were 8.8% and 5%, respectively. Repeatability analysis showed coefficients of variation of 4.6% for log-TDV and 3.5% for median gADC, with intraclass correlation coefficients of 0.94 or higher. The software achieved 80.5% accuracy, 84.3% sensitivity, and 85.7% specificity in assessing response to treatment. Average computation time was 90s per scan.
♻ ☆ Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation MICCAI2024
In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
comment: Journal Extension of "Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks" (MICCAI2024) submitted to MedIA
♻ ☆ Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Side-channel analysis (SCA) poses a real-world threat by exploiting unintentional physical signals to extract secret information from secure devices. Evaluation labs also use the same techniques to certify device security. In recent years, deep learning has emerged as a prominent method for SCA, achieving state-of-the-art attack performance at the cost of interpretability. Understanding how neural networks extract secrets is crucial for security evaluators aiming to defend against such attacks, as only by understanding the attack can one propose better countermeasures. In this work, we apply mechanistic interpretability to neural networks trained for SCA, revealing \textit{how} models exploit \textit{what} leakage in side-channel traces. We focus on sudden jumps in performance to reverse engineer learned representations, ultimately recovering secret masks and moving the evaluation process from black-box to white-box. Our results show that mechanistic interpretability can scale to realistic SCA settings, even when relevant inputs are sparse, model accuracies are low, and side-channel protections prevent standard input interventions.
comment: 17 pages, 13 figures, 1 table
♻ ☆ Large Language Models are Unreliable for Cyber Threat Intelligence
Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.
♻ ☆ HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling
Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences.
♻ ☆ MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents NeurIPS 2025
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
comment: NeurIPS 2025
♻ ☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
♻ ☆ Strategic Classification with Non-Linear Classifiers
In strategic classification, the standard supervised learning setting is extended to support the notion of strategic user behavior in the form of costly feature manipulations made in response to a classifier. While standard learning supports a broad range of model classes, the study of strategic classification has, so far, been dedicated mostly to linear classifiers. This work aims to expand the horizon by exploring how strategic behavior manifests under non-linear classifiers and what this implies for learning. We take a bottom-up approach showing how non-linearity affects decision boundary points, classifier expressivity, and model class complexity. Our results show how, unlike the linear case, strategic behavior may either increase or decrease effective class complexity, and that the complexity decrease may be arbitrarily large. Another key finding is that universal approximators (e.g., neural nets) are no longer universal once the environment is strategic. We demonstrate empirically how this can create performance gaps even on an unrestricted model class.
♻ ☆ Dense Backpropagation Improves Training for Sparse Mixture-of-Experts NeurIPS 2025
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe.
comment: NeurIPS 2025
♻ ☆ Tracking solutions of time-varying variational inequalities
Tracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
♻ ☆ Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.
♻ ☆ Link Prediction with Untrained Message Passing Layers
Message passing neural networks (MPNNs) operate on graphs by exchanging information between neigbouring nodes. MPNNs have been successfully applied to various node-, edge-, and graph-level tasks in areas like molecular science, computer vision, natural language processing, and combinatorial optimization. However, most MPNNs require training on large amounts of labeled data, which can be costly and time-consuming. In this work, we explore the use of various untrained message passing layers in graph neural networks, i.e. variants of popular message passing architecture where we remove all trainable parameters that are used to transform node features in the message passing step. Focusing on link prediction, we find that untrained message passing layers can lead to competitive and even superior performance compared to fully trained MPNNs, especially in the presence of high-dimensional features. We provide a theoretical analysis of untrained message passing by relating the inner products of features implicitly produced by untrained message passing layers to path-based topological node similarity measures. As such, untrained message passing architectures can be viewed as a highly efficient and interpretable approach to link prediction.
♻ ☆ NMCSE: Noise-Robust Multi-Modal Coupling Signal Estimation Method via Optimal Transport for Cardiovascular Disease Detection
The coupling signal refers to a latent physiological signal that characterizes the transformation from cardiac electrical excitation, captured by the electrocardiogram (ECG), to mechanical contraction, recorded by the phonocardiogram (PCG). By encoding the temporal and functional interplay between electrophysiological and hemodynamic events, it serves as an intrinsic link between modalities and offers a unified representation of cardiac function, with strong potential to enhance multi-modal cardiovascular disease (CVD) detection. However, existing coupling signal estimation methods remain highly vulnerable to noise, particularly in real-world clinical and physiological settings, which undermines their robustness and limits practical value. In this study, we propose Noise-Robust Multi-Modal Coupling Signal Estimation (NMCSE), which reformulates coupling signal estimation as a distribution matching problem solved via optimal transport. By jointly aligning amplitude and timing, NMCSE avoids noise amplification and enables stable signal estimation. When integrated into a Temporal-Spatial Feature Extraction (TSFE) network, the estimated coupling signal effectively enhances multi-modal fusion for more accurate CVD detection. To evaluate robustness under real-world conditions, we design two complementary experiments targeting distinct sources of noise. The first uses the PhysioNet 2016 dataset with simulated hospital noise to assess the resilience of NMCSE to clinical interference. The second leverages the EPHNOGRAM dataset with motion-induced physiological noise to evaluate intra-state estimation stability across activity levels. Experimental results show that NMCSE consistently outperforms existing methods under both clinical and physiological noise, highlighting it as a noise-robust estimation approach that enables reliable multi-modal cardiac detection in real-world conditions.
♻ ☆ When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
♻ ☆ Exploring Federated Learning for Thermal Urban Feature Segmentation -- A Comparison of Centralized and Decentralized Approaches CCS
Federated Learning (FL) is an approach for training a shared Machine Learning (ML) model with distributed training data and multiple participants. FL allows bypassing limitations of the traditional Centralized Machine Learning CL if data cannot be shared or stored centrally due to privacy or technical restrictions -- the participants train the model locally with their training data and do not need to share it among the other participants. This paper investigates the practical implementation and effectiveness of FL in a real-world scenario, specifically focusing on unmanned aerial vehicle (UAV)-based thermal images for common thermal feature detection in urban environments. The distributed nature of the data arises naturally and makes it suitable for FL applications, as images captured in two German cities are available. This application presents unique challenges due to non-identical distribution and feature characteristics of data captured at both locations. The study makes several key contributions by evaluating FL algorithms in real deployment scenarios rather than simulation. We compare several FL approaches with a centralized learning baseline across key performance metrics such as model accuracy, training time, communication overhead, and energy usage. This paper also explores various FL workflows, comparing client-controlled workflows and server-controlled workflows. The findings of this work serve as a valuable reference for understanding the practical application and limitations of the FL methods in segmentation tasks in UAV-based imaging.
comment: The Version of Record of this contribution is published in Computational Science and Its Applications (ICCSA) 2025, and is available online at https://doi.org/10.1007/978-3-031-97000-9
♻ ☆ Investigating the Robustness of Knowledge Tracing Models in the Presence of Student Concept Drift
Knowledge Tracing (KT) has been an established problem in the educational data mining field for decades, and it is commonly assumed that the underlying learning process being modeled remains static. Given the ever-changing landscape of online learning platforms (OLPs), we investigate how concept drift and changing student populations can impact student behavior within an OLP through testing model performance both within a single academic year and across multiple academic years. Four well-studied KT models were applied to five academic years of data to assess how susceptible KT models are to concept drift. Through our analysis, we find that all four families of KT models can exhibit degraded performance, Bayesian Knowledge Tracing (BKT) remains the most stable KT model when applied to newer data, while more complex, attention based models lose predictive power significantly faster.
comment: 10 pages, 6 figures
♻ ☆ From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit NeurIPS 2025
Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Scalable and Cost-Efficient de Novo Template-Based Molecular Generation
Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose Recursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an Exploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a Dynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.
♻ ☆ MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver NeurIPS 2025
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.
comment: Accepted at NeurIPS 2025
♻ ☆ Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.
comment: Complementary work to arXiv:2506.03093
♻ ☆ Cash Flow Underwriting with Bank Transaction Data: Advancing MSME Financial Inclusion in Malaysia
Despite accounting for 96.1% of all businesses in Malaysia, access to financing remains one of the most persistent challenges faced by Micro, Small, and Medium Enterprises (MSMEs). Newly established or young businesses are often excluded from formal credit markets as traditional underwriting approaches rely heavily on credit bureau data. This study investigates the potential of bank statement data as an alternative data source for credit assessment to promote financial inclusion in emerging markets. Firstly, we propose a cash flow-based underwriting pipeline where we utilise bank statement data for end-to-end data extraction and machine learning credit scoring. Secondly, we introduce a novel dataset of 611 loan applicants from a Malaysian lending institution. Thirdly, we develop and evaluate credit scoring models based on application information and bank transaction-derived features. Empirical results show that the use of such data boosts the performance of all models on our dataset, which can improve credit scoring for new-to-lending MSMEs. Lastly, we intend to release the anonymised bank transaction dataset to facilitate further research on MSMEs financial inclusion within Malaysia's emerging economy.
comment: Accepted for oral presentation at the AI for Financial Inclusion, Risk Modeling and Resilience in Emerging Markets (FinRem) Workshop at ACM ICAIF 2025, Singapore
♻ ☆ FlowRL: Matching Reward Distributions for LLM Reasoning
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
♻ ☆ Modeling Hierarchical Spaces: A Review and Unified Framework for Surrogate-Based Architecture Design
Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. Our framework defines hierarchical distances and kernels to enable surrogate modeling and optimization on hierarchical domains. We demonstrate its effectiveness on complex system design problems, including a neural network and a green-aircraft case study. Our methods are available in the open-source Surrogate Modeling Toolbox (SMT 2.0).
comment: version number two
♻ ☆ ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters
Neural Forecasters (NFs) are a cornerstone of Long-term Time Series Forecasting (LTSF). However, progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we return to first principles to redesign the LTSF paradigm. We begin by introducing a Multiple Neural Forecasting Theorem that provides a theoretical basis for our approach. We propose Boosted Direct Output (BDO), a novel forecasting strategy that synergistically combines the advantages of both Auto-Regressive (AR) and Direct Output (DO). In addition, we stabilize the learning process by smoothly tracking the model's parameters. Extensive experiments show that these principled improvements enable a simple MLP to achieve state-of-the-art performance, outperforming recent, complex models in nearly all cases, without any specific considerations in the area. Finally, we empirically verify our theorem, establishing a dynamic performance bound and identifying promising directions for future research. The code for review is available at: .
♻ ☆ QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation NeurIPS 2025
The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.
comment: Accepted to NeurIPS 2025
♻ ☆ Explainable Graph Neural Architecture Search via Monte-Carlo Tree Search (Full version)
The number of graph neural network (GNN) architectures has increased rapidly due to the growing adoption of graph analysis. Although we use GNNs in wide application scenarios, it is a laborious task to design/select optimal GNN architectures in diverse graphs. To reduce human efforts, graph neural architecture search (Graph NAS) has been used to search for a sub-optimal GNN architecture that combines existing components. However, existing Graph NAS methods lack explainability to understand the reasons why the model architecture is selected because they use complex search space and neural models to select architecture. Therefore, we propose an explainable Graph NAS method, called ExGNAS, which consists of (i) a simple search space that can adapt to various graphs and (ii) a search algorithm with Monte-Carlo tree that makes the decision process explainable. The combination of our search space and algorithm achieves finding accurate GNN models and the important functions within the search space. We comprehensively evaluate ExGNAS compared with four state-of-the-art Graph NAS methods in twelve graphs. Our experimental results show that ExGNAS achieves high average accuracy and efficiency; improving accuracy up to 26.1% and reducing run time up to 88%. Furthermore, we show the effectiveness of explainability by questionnaire-based user study and architecture analysis.
♻ ☆ Differentiable Fast Top-K Selection for Large-Scale Recommendation
Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.
comment: 12 pages, 5 figures. Minor corrections in v2
♻ ☆ WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models SIGMOD 2026
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
comment: Accepted at ACM SIGMOD 2026
♻ ☆ Are Foundational Atomistic Models Reliable for Finite-Temperature Molecular Dynamics?
Machine learning force fields have emerged as promising tools for molecular dynamics (MD) simulations, potentially offering quantum-mechanical accuracy with the efficiency of classical MD. Inspired by foundational large language models, recent years have seen considerable progress in developing foundational atomistic models, sometimes referred to as universal force fields, designed to cover most elements in the periodic table. This Perspective adopts a practitioner's viewpoint to ask a critical question: Are these foundational atomistic models reliable for one of their most compelling applications, in particular simulating finite-temperature dynamics? Instead of a broad benchmark, we use the canonical ferroelectric-paraelectric phase transition in PbTiO$_3$ as a focused case study to evaluate prominent foundational atomistic models. Our findings suggest a potential disconnect between static accuracy and dynamic reliability. While 0 K properties are often well-reproduced, we observed that the models can struggle to consistently capture the correct phase transition, sometimes exhibiting simulation instabilities. We believe these challenges may stem from inherent biases in training data and a limited description of anharmonicity. These observed shortcomings, though demonstrated on a single system, appear to point to broader, systemic challenges that can be addressed with targeted fine-tuning. This Perspective serves not to rank models, but to initiate a crucial discussion on the practical readiness of foundational atomistic models and to explore future directions for their improvement.
comment: 27 pages, 5 figures
♻ ☆ Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance EMNLP 2025
Although synthetic data has changed various aspects of information retrieval (IR) pipelines, the main training paradigm remains: contrastive learning with binary relevance labels, where one positive document is compared against several negatives using the InfoNCE loss. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus missing subtle nuances useful for ranking. To overcome this limitation, in this work, we forgo real documents and annotations and use large language models to directly generate synthetic documents that answer the MS MARCO queries according to several different levels of relevance. We also propose using Wasserstein distance as a more effective loss function for training transformer-based retrievers with graduated relevance labels. Our experiments on MS MARCO and BEIR benchmark show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents, our method significantly improves self-supervised retrievers and is more robust to distribution shift compared to contrastive learning using real data. Our method also successfully integrates existing real data into the synthetic ranking context, further boosting the performance. Overall, we show that generating multi-level ranking contexts is a better approach to synthetic data generation for IR than just generating the standard positive and negative documents.
comment: Findings of the EMNLP 2025
♻ ☆ An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
comment: 8 pages, 2 figures
♻ ☆ Deep Generative Models for Enhanced Vitreous OCT Imaging
Purpose: To evaluate deep learning (DL) models for enhancing vitreous optical coherence tomography (OCT) image quality and reducing acquisition time. Methods: Conditional Denoising Diffusion Probabilistic Models (cDDPMs), Brownian Bridge Diffusion Models (BBDMs), U-Net, Pix2Pix, and Vector-Quantised Generative Adversarial Network (VQ-GAN) were used to generate high-quality spectral-domain (SD) vitreous OCT images. Inputs were SD ART10 images, and outputs were compared to pseudoART100 images obtained by averaging ten ART10 images per eye location. Model performance was assessed using image quality metrics and Visual Turing Tests, where ophthalmologists ranked generated images and evaluated anatomical fidelity. The best model's performance was further tested within the manually segmented vitreous on newly acquired data. Results: U-Net achieved the highest Peak Signal-to-Noise Ratio (PSNR: 30.230) and Structural Similarity Index Measure (SSIM: 0.820), followed by cDDPM. For Learned Perceptual Image Patch Similarity (LPIPS), Pix2Pix (0.697) and cDDPM (0.753) performed best. In the first Visual Turing Test, cDDPM ranked highest (3.07); in the second (best model only), cDDPM achieved a 32.9% fool rate and 85.7% anatomical preservation. On newly acquired data, cDDPM generated vitreous regions more similar in PSNR to the ART100 reference than true ART1 or ART10 B-scans and achieved higher PSNR on whole images when conditioned on ART1 than ART10. Conclusions: Results reveal discrepancies between quantitative metrics and clinical evaluation, highlighting the need for combined assessment. cDDPM showed strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold. Translational Relevance: cDDPMs show promise for clinical integration, supporting faster, higher-quality vitreous imaging. Dataset and code will be made publicly available.
♻ ☆ LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency
Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.
comment: 17 pages, 5 figures
♻ ☆ Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models NeurIPS 2025
Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.
comment: Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)
♻ ☆ Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula NeurIPS 2025
The advancement of general-purpose intelligent agents is intrinsically linked to the environments in which they are trained. While scaling models and datasets has yielded remarkable capabilities, scaling the complexity, diversity, and interactivity of environments remains a crucial bottleneck. Hand-crafted environments are finite and often contain implicit biases, limiting the potential for agents to develop truly generalizable and robust skills. In this work, we propose a paradigm for generating a boundless and adaptive curriculum of challenges by framing the environment generation process as an adversarial game. We introduce a system where a team of cooperative multi-agent defenders learns to survive against a procedurally generative attacker. The attacker agent learns to produce increasingly challenging configurations of enemy units, dynamically creating novel worlds tailored to exploit the defenders' current weaknesses. Concurrently, the defender team learns cooperative strategies to overcome these generated threats. This co-evolutionary dynamic creates a self-scaling environment where complexity arises organically from the adversarial interaction, providing an effectively infinite stream of novel and relevant training data. We demonstrate that with minimal training, this approach leads to the emergence of complex, intelligent behaviors, such as flanking and shielding by the attacker, and focus-fire and spreading by the defenders. Our findings suggest that adversarial co-evolution is a powerful mechanism for automatically scaling environmental complexity, driving agents towards greater robustness and strategic depth.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA)
♻ ☆ A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random NeurIPS 2025
Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
comment: Binh H. Ho, Long Nguyen Chi, and TrungTin Nguyen are co-first authors. Correct final typos for the NeurIPS 2025 camera-ready version
♻ ☆ The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum NeurIPS 2025
The capacity of an embodied agent to understand, predict, and interact with its environment is fundamentally contingent on an internal world model. This paper introduces a novel framework for investigating the formation and adaptation of such world models within a biological substrate: human neural organoids. We present a curriculum of three scalable, closed-loop virtual environments designed to train these biological agents and probe the underlying synaptic mechanisms of learning, such as long-term potentiation (LTP) and long-term depression (LTD). We detail the design of three distinct task environments that demand progressively more sophisticated world models for successful decision-making: (1) a conditional avoidance task for learning static state-action contingencies, (2) a one-dimensional predator-prey scenario for goal-directed interaction, and (3) a replication of the classic Pong game for modeling dynamic, continuous-time systems. For each environment, we formalize the state and action spaces, the sensory encoding and motor decoding mechanisms, and the feedback protocols based on predictable (reward) and unpredictable (punishment) stimulation, which serve to drive model refinement. In a significant methodological advance, we propose a meta-learning approach where a Large Language Model automates the generative design and optimization of experimental protocols, thereby scaling the process of environment and curriculum design. Finally, we outline a multi-modal evaluation strategy that moves beyond task performance to directly measure the physical correlates of the learned world model by quantifying synaptic plasticity at electrophysiological, cellular, and molecular levels. This work bridges the gap between model-based reinforcement learning and computational neuroscience, offering a unique platform for studying embodiment, decision-making, and the physical basis of intelligence.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in NeurIPS 2025 Workshop: Embodied World Models for Decision Making
♻ ☆ UFGraphFR: Graph Federation Recommendation System based on User Text description features
Federated learning offers a privacy-preserving framework for recommendation systems by enabling local data processing; however, data localization introduces substantial obstacles. Traditional federated recommendation approaches treat each user as an isolated entity, failing to construct global user relationship graphs that capture collaborative signals, which limits the accuracy of recommendations. To address this limitation, we derive insight from the insight that semantic similarity reflects preference. similarity, which can be used to improve the construction of user relationship graphs. This paper proposes UFGraphFR, a novel framework with three key components: 1) On the client side, private structured data is first transformed into text descriptions. These descriptions are then encoded into semantic vectors using pre-trained models; 2) On the server side, user relationship graphs are securely reconstructed using aggregated model weights without accessing raw data, followed by information propagation through lightweight graph neural networks; 3) On the client side, user behavior sequences are personalized using Transformer architectures. Extensive experiments conducted on four benchmark datasets demonstrate that UFGraphFR significantly outperforms state-of-the-art baselines in both recommendation accuracy and personalization. The framework also maintains robustness across different pre-trained models, as evidenced by the consistent performance metrics obtained. This work provides a practical method for efficient federated recommendations with strict privacy by using semantic vectors, secure user relationship graphs, and personalized behavior sequences. The code is available at: https://github.com/trueWangSyutung/UFGraphFR
♻ ☆ Generative World Models of Tasks: LLM-Driven Hierarchical Scaffolding for Embodied Agents NeurIPS 2025
Recent advances in agent development have focused on scaling model size and raw interaction data, mirroring successes in large language models. However, for complex, long-horizon multi-agent tasks such as robotic soccer, this end-to-end approach often fails due to intractable exploration spaces and sparse rewards. We propose that an effective world model for decision-making must model the world's physics and also its task semantics. A systematic review of 2024 research in low-resource multi-agent soccer reveals a clear trend towards integrating symbolic and hierarchical methods, such as Hierarchical Task Networks (HTNs) and Bayesian Strategy Networks (BSNs), with multi-agent reinforcement learning (MARL). These methods decompose complex goals into manageable subgoals, creating an intrinsic curriculum that shapes agent learning. We formalize this trend into a framework for Hierarchical Task Environments (HTEs), which are essential for bridging the gap between simple, reactive behaviors and sophisticated, strategic team play. Our framework incorporates the use of Large Language Models (LLMs) as generative world models of tasks, capable of dynamically generating this scaffolding. We argue that HTEs provide a mechanism to guide exploration, generate meaningful learning signals, and train agents to internalize hierarchical structure, enabling the development of more capable and general-purpose agents with greater sample efficiency than purely end-to-end approaches.
comment: In the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Embodied World Models for Decision Making (EWM)
♻ ☆ Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction
Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.
♻ ☆ Reliably Detecting Model Failures in Deployment Without Labels NeurIPS 2025
The distribution of data changes over time; models operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
comment: 44 pages, 9 figures, 12 tables. Accepted at NeurIPS 2025. Code available at https://github.com/teivng/d3m
♻ ☆ Towards efficient quantum algorithms for diffusion probabilistic models
A diffusion probabilistic model (DPM) is a generative model renowned for its ability to produce high-quality outputs in tasks such as image and audio generation. However, training DPMs on large, high-dimensional datasets such as high-resolution images or audio incurs significant computational, energy, and hardware costs. In this work, we introduce efficient quantum algorithms for implementing DPMs through various quantum ODE solvers. These algorithms highlight the potential of quantum Carleman linearization for diverse mathematical structures, leveraging state-of-the-art quantum linear system solvers (QLSS) or linear combination of Hamiltonian simulations (LCHS). Specifically, we focus on two approaches: DPM-solver-$k$ which employs exact $k$-th order derivatives to compute a polynomial approximation of $\epsilon_\theta(x_\lambda,\lambda)$; and UniPC which uses finite difference of $\epsilon_\theta(x_\lambda,\lambda)$ at different points $(x_{s_m}, \lambda_{s_m})$ to approximate higher-order derivatives. As such, this work represents one of the most direct and pragmatic applications of quantum algorithms to large-scale machine learning models, presumably taking substantial steps towards demonstrating the practical utility of quantum computing.
comment: 7+18 pages, small changes
♻ ☆ Personalized Interpolation: Achieving Efficient Conversion Estimation with Flexible Optimization Windows
Optimizing conversions is crucial in modern online advertising systems, enabling advertisers to deliver relevant products to users and drive business outcomes. However, accurately predicting conversion events remains challenging due to variable time delays between user interactions (e.g., impressions or clicks) and the actual conversions. These delays vary substantially across advertisers and products, necessitating flexible optimization windows tailored to specific conversion behaviors. To address this, we propose a novel \textit{Personalized Interpolation} method that extends existing models based on fixed conversion windows to support flexible advertiser-specific optimization windows. Our method enables accurate conversion estimation across diverse delay distributions without increasing system complexity. We evaluate the effectiveness of the proposed approach through extensive experiments using a real-world ads conversion model. Our results show that this method achieves both high prediction accuracy and improved efficiency compared to existing solutions. This study demonstrates the potential of our Personalized Interpolation method to improve conversion optimization and support a wider range of advertising strategies in large-scale online advertising systems.
♻ ☆ Network Anomaly Traffic Detection via Multi-view Feature Fusion
Traditional anomalous traffic detection methods are based on single-view analysis, which has obvious limitations in dealing with complex attacks and encrypted communications. In this regard, we propose a Multi-view Feature Fusion (MuFF) method for network anomaly traffic detection. MuFF models the temporal and interactive relationships of packets in network traffic based on the temporal and interactive viewpoints respectively. It learns temporal and interactive features. These features are then fused from different perspectives for anomaly traffic detection. Extensive experiments on six real traffic datasets show that MuFF has excellent performance in network anomalous traffic detection, which makes up for the shortcomings of detection under a single perspective.
comment: in Chinese language, Accepted by Journal of Command and Control. https://www.jc2.org.cn/CN/10.20278/j.jc2.2096-0204.2024.0196
♻ ☆ Surrogate modeling of Cellular-Potts Agent-Based Models as a segmentation task using the U-Net neural network architecture
The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate in vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.
♻ ☆ Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space
Recently, optimization on the Riemannian manifold have provided valuable insights to the optimization community. In this regard, extending these methods to to the Wasserstein space is of particular interest, since optimization on Wasserstein space is closely connected to practical sampling processes. Generally, the standard (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the family of continuous optimization methods in the Wasserstein space, by extending the gradient flow on it into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. By leveraging the property of Wasserstein space, we construct stochastic differential equations (SDEs) to approximate the corresponding discrete Euclidean dynamics of the desired Riemannian stochastic methods. Then, we obtain the flows in Wasserstein space by Fokker-Planck equation. Finally, we establish convergence rates of the proposed stochastic flows, which align with those known in the Euclidean setting.
♻ ☆ RASPNet: A Benchmark Dataset for Radar Adaptive Signal Processing Applications
We present a large-scale dataset called RASPNet for radar adaptive signal processing (RASP) applications to support the development of data-driven models within the adaptive radar community. RASPNet exceeds 16 TB in size and comprises 100 realistic scenarios compiled over a variety of topographies and land types across the contiguous United States. For each scenario, RASPNet comprises 10,000 clutter realizations from an airborne radar setting, which can be used to benchmark radar and complex-valued learning algorithms. RASPNet intends to fill a prominent gap in the availability of a large-scale, realistic dataset that standardizes the evaluation of RASP techniques and complex-valued neural networks. We outline its construction, organization, and several applications, including a transfer learning example to demonstrate how RASPNet can be used for real-world adaptive radar scenarios.
♻ ☆ Testing with Non-identically Distributed Samples
We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{avg}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{avg}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity -- distinguishing the case that $p_{avg}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.
♻ ☆ Reset & Distill: A Recipe for Overcoming Negative Transfer in Continual Reinforcement Learning
We argue that the negative transfer problem occurring when the new task to learn arrives is an important problem that needs not be overlooked when developing effective Continual Reinforcement Learning (CRL) algorithms. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on either mitigating plasticity loss of RL agents or enhancing the positive transfer in CRL scenario. To that end, we develop Reset & Distill (R&D), a simple yet highly effective baseline method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent's online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert's action probabilities. We carried out extensive experiments on long sequence of Meta World tasks and show that our simple baseline method consistently outperforms recent approaches, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.
♻ ☆ Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
comment: 15 pages, 8 figures, 4 tables
♻ ☆ ARC-GEN: A Mimetic Procedural Benchmark Generator for the Abstraction and Reasoning Corpus
The Abstraction and Reasoning Corpus remains one of the most compelling and challenging benchmarks for tracking progress toward achieving Artificial General Intelligence. In contrast to other evaluation datasets designed to assess an agent's task-specific skills or accumulated knowledge, the ARC-AGI suite is specifically targeted at measuring skill acquisition efficiency, a trait that has (so far) been lacking in even the most sophisticated machine learning systems. For algorithms that require extensive intra-task exemplars, a significant constraint imposed by ARC-AGI is the modest cardinality of its demonstration set, comprising a small number of $\langle$ input, output $\rangle$ grids per task specifying the corresponding transformation. To embellish the space of viable sample pairs, this paper introduces ARC-GEN, an open-source procedural generator aimed at extending the original ARC-AGI training dataset as faithfully as possible. Unlike prior efforts, our generator is both exhaustive (covering all four-hundred tasks) and mimetic (more closely honoring the distributional properties and characteristics embodied in the initial ARC-AGI-1 release). We also discuss the use of this generator in establishing a static benchmark suite to verify the correctness of programs submitted to the 2025 Google Code Golf Championship.
♻ ☆ Exploring Human-AI Conceptual Alignment through the Prism of Chess
Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85\% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65\% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20\% across all methods, revealing the model's reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.
♻ ☆ Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport
Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretability at the cell-type level. Here, we present OT-MESH, an unsupervised computational framework leveraging entropy-regularized optimal transport (OT) to systematically determine cross-species cell type homologies. Our method uniquely integrates the Minimize Entropy of Sinkhorn (MESH) technique to refine the OT plan, transforming diffuse transport matrices into sparse, interpretable correspondences. Through systematic evaluation on synthetic datasets, we demonstrate that OT-MESH achieves near-optimal matching accuracy with computational efficiency, while maintaining remarkable robustness to noise. Compared to other OT-based methods like RefCM, OT-MESH provides speedup while achieving comparable accuracy. Applied to retinal bipolar cells (BCs) and retinal ganglion cells (RGCs) from mouse and macaque, OT-MESH accurately recovers known evolutionary relationships and uncovers novel correspondences, one of which was independently validated experimentally. Thus, our framework offers a principled, scalable, and interpretable solution for evolutionary cell type mapping, facilitating deeper insights into cellular specialization and conservation across species.
♻ ☆ Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Vision-Language Models (VLMs) suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned foundational and task-specific capabilities. While multi-task learning can mitigate forgetting, it requires simultaneous access to all datasets and imposes computational overhead that scales linearly with the number of tasks. In this work, we introduce a routing-based approach that enables the integration of new tasks while preserving the foundational knowledge acquired during pretraining. We evaluate our method using InternVL-2 models (2B and 8B parameters) and demonstrate that routing preserves the model's foundational capabilities by maintaining performance on general-purpose benchmarks such as ChartQA, MMBench, and DocVQA, while simultaneously improving accuracy on specialized tasks. Importantly, our approach achieves this without requiring concurrent access to data from all tasks, avoiding the significant computational and data overhead associated with traditional multi-task learning. We further conduct extensive ablation studies to evaluate the scalability and robustness of routing-based learning, showing that the approach is resilient to a growing number of tasks and performs particularly well when new tasks are semantically related. Finally, we show that the routing mechanism enables superior cross-modal transfer between language and vision capabilities, allowing knowledge learned in one modality to enhance performance in another capability not achieved by existing continual learning methods.
♻ ☆ Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals NeurIPS 2025
Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.
comment: Accepted by NeurIPS 2025 Workshop on Efficient Reasoning
♻ ☆ How Effective Are Time-Series Models for Precipitation Nowcasting? A Comprehensive Benchmark for GNSS-based Precipitation Nowcasting
Precipitation Nowcasting, which aims to predict precipitation within the next 0 to 6 hours, is critical for disaster mitigation and real-time response planning. However, most time series forecasting benchmarks in meteorology are evaluated on variables with strong periodicity, such as temperature and humidity, which fail to reflect model capabilities in more complex and practically meteorology scenarios like precipitation nowcasting. To address this gap, we propose RainfallBench, a benchmark designed for precipitation nowcasting, a highly challenging and practically relevant task characterized by zero inflation, temporal decay, and non-stationarity, focusing on predicting precipitation within the next 0 to 6 hours. The dataset is derived from five years of meteorological observations, recorded at hourly intervals across six essential variables, and collected from more than 140 Global Navigation Satellite System (GNSS) stations globally. In particular, it incorporates precipitable water vapor (PWV), a crucial indicator of rainfall that is absent in other datasets. We further design specialized evaluation protocols to assess model performance on key meteorological challenges, including multi-scale prediction, multi-resolution forecasting, and extreme rainfall events, benchmarking 17 state-of-the-art models across six major architectures on RainfallBench. Additionally, to address the zero-inflation and temporal decay issues overlooked by existing models, we introduce Bi-Focus Precipitation Forecaster (BFPF), a plug-and-play module that incorporates domain-specific priors to enhance rainfall time series forecasting. Statistical analysis and ablation studies validate the comprehensiveness of our dataset as well as the superiority of our methodology.
comment: 13 pages,11 figures
♻ ☆ Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.
comment: 15 pages, 8 figures
♻ ☆ Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
♻ ☆ Emergence and scaling laws in SGD learning of shallow neural networks NeurIPS 2025
We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_*(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^*\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}_d)$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k_*>2$ (defined as the lowest degree in the Hermite expansion), $\{\boldsymbol{v}^*_p\}_{p\in[P]}\subset \mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p^2=1$. We focus on the challenging ``extensive-width'' regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p^{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.
comment: NeurIPS 2025
♻ ☆ Energy-Based Model for Accurate Estimation of Shapley Values in Feature Attribution
Shapley value is a widely used tool in explainable artificial intelligence (XAI), as it provides a principled way to attribute contributions of input features to model outputs. However, estimation of Shapley value requires capturing conditional dependencies among all feature combinations, which poses significant challenges in complex data environments. In this article, EmSHAP (Energy-based model for Shapley value estimation), an accurate Shapley value estimation method, is proposed to estimate the expectation of Shapley contribution function under the arbitrary subset of features given the rest. By utilizing the ability of energy-based model (EBM) to model complex distributions, EmSHAP provides an effective solution for estimating the required conditional probabilities. To further improve estimation accuracy, a GRU (Gated Recurrent Unit)-coupled partition function estimation method is introduced. The GRU network captures long-term dependencies with a lightweight parameterization and maps input features into a latent space to mitigate the influence of feature ordering. Additionally, a dynamic masking mechanism is incorporated to further enhance the robustness and accuracy by progressively increasing the masking rate. Theoretical analysis on the error bound as well as application to four case studies verified the higher accuracy and better scalability of EmSHAP in contrast to competitive methods.
♻ ☆ Learning with Category-Equivariant Architectures for Human Activity Recognition
We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.
♻ ☆ End-to-End Probabilistic Framework for Learning with Hard Constraints
We present ProbHardE2E, a probabilistic forecasting framework that incorporates hard operational/physical constraints, and provides uncertainty quantification. Our methodology uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with a wide range of neural network architectures. DPPL allows the model to learn the system in an end-to-end manner, compared to other approaches where constraints are satisfied either through a post-processing step or at inference. ProbHardE2E optimizes a strictly proper scoring rule, without making any distributional assumptions on the target, which enables it to obtain robust distributional estimates (in contrast to existing approaches that generally optimize likelihood-based objectives, which are heavily biased by their distributional assumptions and model choices); and it can incorporate a range of non-linear constraints (increasing the power of modeling and flexibility). We apply ProbHardE2E in learning partial differential equations with uncertainty estimates and to probabilistic time-series forecasting, showcasing it as a broadly applicable general framework that connects these seemingly disparate domains.
comment: 45 pages, 5 figures, 10 tables
♻ ☆ Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning
Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm in which goal-reaching policies are trained from abundant state-action trajectory datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. Identifying the root cause of this challenge, we observe the following insight. Firstly, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Secondly, when learning the high-level policy in the long-horizon regime, the sign of the advantage estimate frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage estimate for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, our approach contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy learned using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.
♻ ☆ Perturbing the Derivative: Wild Refitting for Model-Free Evaluation of Machine Learning Models under Bregman Losses
We study the excess risk evaluation of classical penalized empirical risk minimization (ERM) with Bregman losses. We show that by leveraging the idea of wild refitting, one can efficiently upper bound the excess risk through the so-called "wild optimism," without relying on the global structure of the underlying function class. This property makes our approach inherently model-free. Unlike conventional analysis, our framework operates with just one dataset and black-box access to the training procedure. The method involves randomized Rademacher symmetrization and constructing artificially modified outputs by perturbation in the derivative space with appropriate scaling, upon which we retrain a second predictor for excess risk estimation. We establish high-probability performance guarantees both under the fixed design setting and the random design setting, demonstrating that wild refitting under Bregman losses, with an appropriately chosen wild noise scale, yields a valid upper bound on the excess risk. Thus, our work is promising for theoretically evaluating modern opaque ML models, such as deep neural networks and generative models, where the function class is too complex for classical learning theory and empirical process techniques.
♻ ☆ Remasking Discrete Diffusion Models with Inference-Time Scaling NeurIPS 2025
Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://remdm.github.io
comment: NeurIPS 2025. Project page: https://remdm.github.io
♻ ☆ DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning NeurIPS 2025
Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM's domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.
comment: 28 pages, 10 figures, to appear in NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention NeurIPS 2025
State-space models (SSMs) have recently emerged as a compelling alternative to Transformers for sequence modeling tasks. This paper presents a theoretical generalization analysis of selective SSMs, the core architectural component behind the Mamba model. We derive a novel covering number-based generalization bound for selective SSMs, building upon recent theoretical advances in the analysis of Transformer models. Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model's stability during training and its ability to generalize across sequence lengths. We empirically validate our findings on a synthetic majority task, the IMDb sentiment classification benchmark, and the ListOps task, demonstrating how our theoretical insights translate into practical model behavior.
comment: Accepted at NeurIPS 2025
♻ ☆ Weakly Supervised Object Segmentation by Background Conditional Divergence
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images and then, during learning, create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The code for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
comment: Published in TMLR: https://openreview.net/forum?id=2JJZhfGvMW
♻ ☆ Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification NeurIPS 2025
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
comment: Published in NeurIPS 2025
♻ ☆ AGNES: Adaptive Graph Neural Network and Dynamic Programming Hybrid Framework for Real-Time Nanopore Seed Chaining
Nanopore sequencing enables real-time long-read DNA sequencing with reads exceeding 10 kilobases, but inherent error rates of 12-15 percent present significant computational challenges for read alignment. The critical seed chaining step must connect exact k-mer matches between reads and reference genomes while filtering spurious matches, yet state-of-the-art methods rely on fixed gap penalty functions unable to adapt to varying genomic contexts including tandem repeats and structural variants. This paper presents RawHash3, a hybrid framework combining graph neural networks with classical dynamic programming for adaptive seed chaining that maintains real-time performance while providing statistical guarantees. We formalize seed chaining as graph learning where seeds constitute nodes with 12-dimensional feature vectors and edges encode 8-dimensional spatial relationships including gap consistency. Our architecture employs three-layer EdgeConv GNN with confidence-based method selection that dynamically switches between learned guidance and algorithmic fallback. Comprehensive evaluation on 1,000 synthetic nanopore reads with 5,200 test seeds demonstrates RawHash3 achieves 99.94 percent precision and 40.07 percent recall, representing statistically significant 25.0 percent relative improvement over baseline with p less than 0.001. The system maintains median inference latency of 1.59ms meeting real-time constraints, while demonstrating superior robustness with 100 percent success rate under 20 percent label corruption versus baseline degradation to 30.3 percent. Cross-validation confirms stability establishing graph neural networks as viable approach for production genomics pipelines.
comment: 31 pages, 12 figures, 6 tables. Submitted to ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB). Includes comprehensive evaluation with statistical validation, ablation studies, and open-source implementation
Graphics 4
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
☆ OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
♻ ☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene Editing), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, GS-Verse facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
♻ ☆ Adv-BMT: Bidirectional Motion Transformer for Safety-Critical Traffic Scenario Generation
Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial traffic interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes agent information in the last time step of the scenario as input, and reconstructs the traffic in the inverse of chronological order until the initial time step. The Adv-BMT framework is a two-staged pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining, and are able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20%. Demo and code are available at: https://metadriverse.github.io/adv-bmt/.
Robotics 66
☆ TACO: Trajectory-Aware Controller Optimization for Quadrotors ICRA 2026
Controller performance in quadrotor trajectory tracking depends heavily on parameter tuning, yet standard approaches often rely on fixed, manually tuned parameters that sacrifice task-specific performance. We present Trajectory-Aware Controller Optimization (TACO), a framework that adapts controller parameters online based on the upcoming reference trajectory and current quadrotor state. TACO employs a learned predictive model and a lightweight optimization scheme to optimize controller gains in real time with respect to a broad class of trajectories, and can also be used to adapt trajectories to improve dynamic feasibility while respecting smoothness constraints. To enable large-scale training, we also introduce a parallelized quadrotor simulator supporting fast data collection on diverse trajectories. Experiments on a variety of trajectory types show that TACO outperforms conventional, static parameter tuning while operating orders of magnitude faster than black-box optimization baselines, enabling practical real-time deployment on a physical quadrotor. Furthermore, we show that adapting trajectories using TACO significantly reduces the tracking error obtained by the quadrotor.
comment: 8 pages, 6 figures. In submission to ICRA 2026
☆ TurboMap: GPU-Accelerated Local Mapping for Visual SLAM ICRA 2026
This paper presents TurboMap, a GPU-accelerated and CPU-optimized local mapping module for visual SLAM systems. We identify key performance bottlenecks in the local mapping process for visual SLAM and address them through targeted GPU and CPU optimizations. Specifically, we offload map point triangulation and fusion to the GPU, accelerate redundant keyframe culling on the CPU, and integrate a GPU-accelerated solver to speed up local bundle adjustment. Our implementation is built on top of ORB-SLAM3 and leverages CUDA for GPU programming. The experimental results show that TurboMap achieves an average speedup of 1.3x in the EuRoC dataset and 1.6x in the TUM-VI dataset in the local mapping module, on both desktop and embedded platforms, while maintaining the accuracy of the original system.
comment: Submitted to ICRA 2026
☆ Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance
Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number $>10^{11}$. It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.
comment: Under review, IEEE Letters
☆ Stein-based Optimization of Sampling Distributions in Model Predictive Path Integral Control
This paper presents a novel method for Model Predictive Path Integral (MPPI) control that optimizes sample generation towards an optimal trajectory through Stein Variational Gradient Descent (SVGD). MPPI is traditionally reliant on randomly sampled trajectories, often by a Gaussian distribution. The result can lead to sample deprivation, under-representing the space of possible trajectories, and yield suboptimal results. Through introducing SVGD updates in between MPPI environment steps, we present Stein-Optimized Path-Integral Inference (SOPPI), an MPPI/SVGD algorithm that can dynamically update noise distributions at runtime to shape a more optimal representation without an excessive increase in computational requirements. We demonstrate the efficacy of our method systems ranging from a Cart-Pole to a two-dimensional bipedal walking task, indicating improved performance above standard MPPI across a range of hyper-parameters and demonstrate feasibility at lower particle counts. We discuss the applicability of this MPPI/SVGD method to higher degree-of-freedom systems, as well as its potential to new developments in state-of-the-art differentiable simulators.
comment: 8 pages, 6 figures
☆ TRACE: Textual Reasoning for Affordance Coordinate Extraction ICCV 2025
Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR's effectiveness. Furthermore, analysis of the model's attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE
comment: ICCV 2025. *Equal contribution. {\dag}Corresponding author
☆ Hybrid Neural Network-Based Indoor Localisation System for Mobile Robots Using CSI Data in a Robotics Simulator
We present a hybrid neural network model for inferring the position of mobile robots using Channel State Information (CSI) data from a Massive MIMO system. By leveraging an existing CSI dataset, our approach integrates a Convolutional Neural Network (CNN) with a Multilayer Perceptron (MLP) to form a Hybrid Neural Network (HyNN) that estimates 2D robot positions. CSI readings are converted into synthetic images using the TINTO tool. The localisation solution is integrated with a robotics simulator, and the Robot Operating System (ROS), which facilitates its evaluation through heterogeneous test cases, and the adoption of state estimators like Kalman filters. Our contributions illustrate the potential of our HyNN model in achieving precise indoor localisation and navigation for mobile robots in complex environments. The study follows, and proposes, a generalisable procedure applicable beyond the specific use case studied, making it adaptable to different scenarios and datasets.
comment: 13 pages, 7 figures. Conference paper (ROBOVIS 2025)
☆ Fractional Diffusion Bridge Models NeurIPS 2025
We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr\"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr\'echet Inception Distance (FID) in unpaired image translation.
comment: To appear in NeurIPS 2025 proceedings. This version includes post-camera-ready revisions
☆ GenDexHand: Generative Simulation for Dexterous Hands
Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation. Our website: https://winniechen2002.github.io/GenDexHand/.
☆ MOBIUS: A Multi-Modal Bipedal Robot that can Walk, Crawl, Climb, and Roll
This article presents a Multi-Modal Bipedal Intelligent Urban Scout robot (MOBIUS) capable of walking, crawling, climbing, and rolling. MOBIUS features four limbs--two 6-DoF arms with two-finger grippers for manipulation and climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across diverse terrains without reconfiguration. A hybrid control architecture combines reinforcement learning-based locomotion with model-based predictive and admittance control enhanced for safety by a Reference Governor toward compliant contact interactions. A high-level MIQCP planner autonomously selects locomotion modes to balance stability and energy efficiency. Hardware experiments demonstrate robust gait transitions, dynamic climbing, and full-body load support via pinch grasp. Overall, MOBIUS demonstrates the importance of tight integration between morphology, high-level planning, and control to enable mobile loco-manipulation and grasping, substantially expanding its interaction capabilities, workspace, and traversability.
comment: 23 pages, 20 figures. Collaborative work between the Robotics and Mechanisms Laboratory (RoMeLa) and Mitsubishi Electric Research Laboratories (MERL)
☆ Lightweight Learning from Actuation-Space Demonstrations via Flow Matching for Whole-Body Soft Robotic Grasping
Robotic grasping under uncertainty remains a fundamental challenge due to its uncertain and contact-rich nature. Traditional rigid robotic hands, with limited degrees of freedom and compliance, rely on complex model-based and heavy feedback controllers to manage such interactions. Soft robots, by contrast, exhibit embodied mechanical intelligence: their underactuated structures and passive flexibility of their whole body, naturally accommodate uncertain contacts and enable adaptive behaviors. To harness this capability, we propose a lightweight actuation-space learning framework that infers distributional control representations for whole-body soft robotic grasping, directly from deterministic demonstrations using a flow matching model (Rectified Flow),without requiring dense sensing or heavy control loops. Using only 30 demonstrations (less than 8% of the reachable workspace), the learned policy achieves a 97.5% grasp success rate across the whole workspace, generalizes to grasped-object size variations of +-33%, and maintains stable performance when the robot's dynamic response is directly adjusted by scaling the execution time from 20% to 200%. These results demonstrate that actuation-space learning, by leveraging its passive redundant DOFs and flexibility, converts the body's mechanics into functional control intelligence and substantially reduces the burden on central controllers for this uncertain-rich task.
☆ 3EED: Ground Everything Everywhere in 3D NeurIPS 2025
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
comment: NeurIPS 2025 DB Track; 29 pages, 17 figures, 10 tables; Project Page at https://project-3eed.github.io/
☆ Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we demonstrate its effectiveness through in-depth analysis and real-world evaluations. Our project page is available at https://irpn-eai.github.io/UD-VLA.github.io/.
☆ MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence
Multimodal large language models (MLLMs) have shown remarkable capabilities in cross-modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk-aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi-agent decision-making, the framework enables adaptive, risk-aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk-aware planning and coordinated multi-agent execution compared with state-of-the-art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM-enabled multi-agent systems in real-world environments.
comment: 3 figures, 1 table; under review at Multimedia Systems (Springer)
☆ PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.
comment: 17pages,7 figures, 5 tabels
☆ Phy-Tac: Toward Human-Like Grasping via Physics-Conditioned Tactile Goals
Humans naturally grasp objects with minimal level required force for stability, whereas robots often rely on rigid, over-squeezing control. To narrow this gap, we propose a human-inspired physics-conditioned tactile method (Phy-Tac) for force-optimal stable grasping (FOSG) that unifies pose selection, tactile prediction, and force regulation. A physics-based pose selector first identifies feasible contact regions with optimal force distribution based on surface geometry. Then, a physics-conditioned latent diffusion model (Phy-LDM) predicts the tactile imprint under FOSG target. Last, a latent-space LQR controller drives the gripper toward this tactile imprint with minimal actuation, preventing unnecessary compression. Trained on a physics-conditioned tactile dataset covering diverse objects and contact conditions, the proposed Phy-LDM achieves superior tactile prediction accuracy, while the Phy-Tac outperforms fixed-force and GraspNet-based baselines in grasp stability and force efficiency. Experiments on classical robotic platforms demonstrate force-efficient and adaptive manipulation that bridges the gap between robotic and human grasping.
comment: 9 papges, 10 figures, 3 tables
☆ Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning
Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.
comment: 18 pages, 14 figures
☆ SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation
Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.
☆ Floor Plan-Guided Visual Navigation Incorporating Depth and Directional Cues
Guiding an agent to a specific target in indoor environments based solely on RGB inputs and a floor plan is a promising yet challenging problem. Although existing methods have made significant progress, two challenges remain unresolved. First, the modality gap between egocentric RGB observations and the floor plan hinders the integration of visual and spatial information for both local obstacle avoidance and global planning. Second, accurate localization is critical for navigation performance, but remains challenging at deployment in unseen environments due to the lack of explicit geometric alignment between RGB inputs and floor plans. We propose a novel diffusion-based policy, denoted as GlocDiff, which integrates global path planning from the floor plan with local depth-aware features derived from RGB observations. The floor plan offers explicit global guidance, while the depth features provide implicit geometric cues, collectively enabling precise prediction of optimal navigation directions and robust obstacle avoidance. Moreover, GlocDiff introduces noise perturbation during training to enhance robustness against pose estimation errors, and we find that combining this with a relatively stable VO module during inference results in significantly improved navigation performance. Extensive experiments on the FloNa benchmark demonstrate GlocDiff's efficiency and effectiveness in achieving superior navigation performance, and the success of real-world deployments also highlights its potential for widespread practical applications.
☆ MO-SeGMan: Rearrangement Planning Framework for Multi Objective Sequential and Guided Manipulation in Constrained Environments
In this work, we introduce MO-SeGMan, a Multi-Objective Sequential and Guided Manipulation planner for highly constrained rearrangement problems. MO-SeGMan generates object placement sequences that minimize both replanning per object and robot travel distance while preserving critical dependency structures with a lazy evaluation method. To address highly cluttered, non-monotone scenarios, we propose a Selective Guided Forward Search (SGFS) that efficiently relocates only critical obstacles and to feasible relocation points. Furthermore, we adopt a refinement method for adaptive subgoal selection to eliminate unnecessary pick-and-place actions, thereby improving overall solution quality. Extensive evaluations on nine benchmark rearrangement tasks demonstrate that MO-SeGMan generates feasible motion plans in all cases, consistently achieving faster solution times and superior solution quality compared to the baselines. These results highlight the robustness and scalability of the proposed framework for complex rearrangement planning problems.
comment: 8 pages, 8 figures, website:https://sites.google.com/view/mo-segman/
☆ AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models
The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.
☆ Designing for Distributed Heterogeneous Modularity: On Software Architecture and Deployment of MoonBots SP
This paper presents the software architecture and deployment strategy behind the MoonBot platform: a modular space robotic system composed of heterogeneous components distributed across multiple computers, networks and ultimately celestial bodies. We introduce a principled approach to distributed, heterogeneous modularity, extending modular robotics beyond physical reconfiguration to software, communication and orchestration. We detail the architecture of our system that integrates component-based design, a data-oriented communication model using ROS2 and Zenoh, and a deployment orchestrator capable of managing complex multi-module assemblies. These abstractions enable dynamic reconfiguration, decentralized control, and seamless collaboration between numerous operators and modules. At the heart of this system lies our open-source Motion Stack software, validated by months of field deployment with self-assembling robots, inter-robot cooperation, and remote operation. Our architecture tackles the significant hurdles of modular robotics by significantly reducing integration and maintenance overhead, while remaining scalable and robust. Although tested with space in mind, we propose generalizable patterns for designing robotic systems that must scale across time, hardware, teams and operational environments.
comment: 6 pages, 8 figures. Accepted at ISPARO 2025
☆ FoldPath: End-to-End Object-Centric Motion Generation via Modulated Implicit Paths IROS 2025
Object-Centric Motion Generation (OCMG) is instrumental in advancing automated manufacturing processes, particularly in domains requiring high-precision expert robotic motions, such as spray painting and welding. To realize effective automation, robust algorithms are essential for generating extended, object-aware trajectories across intricate 3D geometries. However, contemporary OCMG techniques are either based on ad-hoc heuristics or employ learning-based pipelines that are still reliant on sensitive post-processing steps to generate executable paths. We introduce FoldPath, a novel, end-to-end, neural field based method for OCMG. Unlike prior deep learning approaches that predict discrete sequences of end-effector waypoints, FoldPath learns the robot motion as a continuous function, thus implicitly encoding smooth output paths. This paradigm shift eliminates the need for brittle post-processing steps that concatenate and order the predicted discrete waypoints. Particularly, our approach demonstrates superior predictive performance compared to recently proposed learning-based methods, and attains generalization capabilities even in real industrial settings, where only a limited amount of 70 expert samples are provided. We validate FoldPath through comprehensive experiments in a realistic simulation environment and introduce new, rigorous metrics designed to comprehensively evaluate long-horizon robotic paths, thus advancing the OCMG task towards practical maturity.
comment: Accepted at 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ CaRLi-V: Camera-RADAR-LiDAR Point-Wise 3D Velocity Estimation
Accurate point-wise velocity estimation in 3D is crucial for robot interaction with non-rigid, dynamic agents, such as humans, enabling robust performance in path planning, collision avoidance, and object manipulation in dynamic environments. To this end, this paper proposes a novel RADAR, LiDAR, and camera fusion pipeline for point-wise 3D velocity estimation named CaRLi-V. This pipeline leverages raw RADAR measurements to create a novel RADAR representation, the velocity cube, which densely represents radial velocities within the RADAR's field-of-view. By combining the velocity cube for radial velocity extraction, optical flow for tangential velocity estimation, and LiDAR for point-wise range measurements through a closed-form solution, our approach can produce 3D velocity estimates for a dense array of points. Developed as an open-source ROS2 package, CaRLi-V has been field-tested against a custom dataset and proven to produce low velocity error metrics relative to ground truth, enabling point-wise velocity estimation for robotic applications.
☆ EREBUS: End-to-end Robust Event Based Underwater Simulation ICRA
The underwater domain presents a vast array of challenges for roboticists and computer vision researchers alike, such as poor lighting conditions and high dynamic range scenes. In these adverse conditions, traditional vision techniques struggle to adapt and lead to suboptimal performance. Event-based cameras present an attractive solution to this problem, mitigating the issues of traditional cameras by tracking changes in the footage on a frame-by-frame basis. In this paper, we introduce a pipeline which can be used to generate realistic synthetic data of an event-based camera mounted to an AUV (Autonomous Underwater Vehicle) in an underwater environment for training vision models. We demonstrate the effectiveness of our pipeline using the task of rock detection with poor visibility and suspended particulate matter, but the approach can be generalized to other underwater tasks.
comment: Accepted to ICRA AQUA2SIM Workshop 2025, 6 pages, 3 figures, conference paper
☆ CM-LIUW-Odometry: Robust and High-Precision LiDAR-Inertial-UWB-Wheel Odometry for Extreme Degradation Coal Mine Tunnels IROS 2025
Simultaneous Localization and Mapping (SLAM) in large-scale, complex, and GPS-denied underground coal mine environments presents significant challenges. Sensors must contend with abnormal operating conditions: GPS unavailability impedes scene reconstruction and absolute geographic referencing, uneven or slippery terrain degrades wheel odometer accuracy, and long, feature-poor tunnels reduce LiDAR effectiveness. To address these issues, we propose CoalMine-LiDAR-IMU-UWB-Wheel-Odometry (CM-LIUW-Odometry), a multimodal SLAM framework based on the Iterated Error-State Kalman Filter (IESKF). First, LiDAR-inertial odometry is tightly fused with UWB absolute positioning constraints to align the SLAM system with a global coordinate. Next, wheel odometer is integrated through tight coupling, enhanced by nonholonomic constraints (NHC) and vehicle lever arm compensation, to address performance degradation in areas beyond UWB measurement range. Finally, an adaptive motion mode switching mechanism dynamically adjusts the robot's motion mode based on UWB measurement range and environmental degradation levels. Experimental results validate that our method achieves superior accuracy and robustness in real-world underground coal mine scenarios, outperforming state-of-the-art approaches. We open source our code of this work on Github to benefit the robotics community.
comment: Accepted by IROS 2025
☆ Lateral Velocity Model for Vehicle Parking Applications
Automated parking requires accurate localization for quick and precise maneuvering in tight spaces. While the longitudinal velocity can be measured using wheel encoders, the estimation of the lateral velocity remains a key challenge due to the absence of dedicated sensors in consumer-grade vehicles. Existing approaches often rely on simplified vehicle models, such as the zero-slip model, which assumes no lateral velocity at the rear axle. It is well established that this assumption does not hold during low-speed driving and researchers thus introduce additional heuristics to account for differences. In this work, we analyze real-world data from parking scenarios and identify a systematic deviation from the zero-slip assumption. We provide explanations for the observed effects and then propose a lateral velocity model that better captures the lateral dynamics of the vehicle during parking. The model improves estimation accuracy, while relying on only two parameters, making it well-suited for integration into consumer-grade applications.
comment: This manuscript has been submitted to Vehicle System Dynamics for possible publication
☆ Model to Model: Understanding the Venus Flytrap Snapping Mechanism and Transferring it to a 3D-printed Bistable Soft Robotic Demonstrator
The Venus flytrap (Dionaea muscipula) does not only serve as the textbook model for a carnivorous plant, but also has long intrigued both botanists and engineers with its rapidly closing leaf trap. The trap closure is triggered by two consecutive touches of a potential prey, after which the lobes rapidly switch from their concave open-state to their convex close-state and catch the prey within 100-500 ms after being triggered. This transformation from concave to convex is initiated by changes in turgor pressure and the release of stored elastic energy from prestresses in the concave state, which accelerate this movement, leading to inversion of the lobes bi-axial curvature. Possessing two low-energy states, the leaves can be characterized as bistable systems. With our research, we seek to deepen the understanding of Venus flytrap motion mechanics and apply its principles to the design of an artificial bistable lobe actuator. We identified geometrical characteristics, such as dimensional ratios and the thickness gradient in the lobe, and transferred these to two 3D-printed bistable actuator models. One actuator parallels the simulated geometry of a Venus flytrap leaf, the other is a lobe model designed with CAD. Both models display concave-convex bi-stability and snap close. These demonstrators are the first step in the development of an artificial Venus flytrap that mimics the mechanical behavior of the biological model and can be used as a soft fast gripper.
comment: Conference Proceedings Paper Living machines 2025
☆ Design and development of an electronics-free earthworm robot
Soft robotic systems have gained widespread attention due to their inherent flexibility, adaptability, and safety, making them well-suited for varied applications. Among bioinspired designs, earthworm locomotion has been extensively studied for its efficient peristaltic motion, enabling movement in confined and unstructured environments. Existing earthworm-inspired robots primarily utilize pneumatic actuation due to its high force-to-weight ratio and ease of implementation. However, these systems often rely on bulky, power-intensive electronic control units, limiting their practicality. In this work, we present an electronics-free, earthworm-inspired pneumatic robot utilizing a modified Pneumatic Logic Gate (PLG) design. By integrating preconfigured PLG units with bellow actuators, we achieved a plug-and-play style modular system capable of peristaltic locomotion without external electronic components. The proposed design reduces system complexity while maintaining efficient actuation. We characterize the bellow actuators under different operating conditions and evaluate the robots locomotion performance. Our findings demonstrate that the modified PLG-based control system effectively generates peristaltic wave propagation, achieving autonomous motion with minimal deviation. This study serves as a proof of concept for the development of electronics-free, peristaltic soft robots. The proposed system has potential for applications in hazardous environments, where untethered, adaptable locomotion is critical. Future work will focus on further optimizing the robot design and exploring untethered operation using onboard compressed air sources.
comment: Conference Proceedings Paper Living Machines 2025
☆ Thermo-responsive closing and reopening artificial Venus Flytrap utilizing shape memory elastomers
Despite their often perceived static and slow nature, some plants can move faster than the blink of an eye. The rapid snap closure motion of the Venus flytrap (Dionaea muscipula) has long captivated the interest of researchers and engineers alike, serving as a model for plant-inspired soft machines and robots. The translation of the fast snapping closure has inspired the development of various artificial Venus flytrap (AVF) systems. However, translating both the closing and reopening motion of D. muscipula into an autonomous plant inspired soft machine has yet to be achieved. In this study, we present an AVF that autonomously closes and reopens, utilizing novel thermo-responsive UV-curable shape memory materials for soft robotic systems. The life-sized thermo-responsive AVF exhibits closing and reopening motions triggered in a naturally occurring temperature range. The doubly curved trap lobes, built from shape memory polymers, close at 38{\deg}C, while reopening initiates around 45{\deg}C, employing shape memory elastomer strips as antagonistic actuators to facilitate lobe reopening. This work represents the first demonstration of thermo-responsive closing and reopening in an AVF with programmed sequential motion in response to increasing temperature. This approach marks the next step toward autonomously bidirectional moving soft machines/robots.
comment: Conference Proceedings Paper Living Machines 2025
☆ Embodied Cognition Augmented End2End Autonomous Driving
In recent years, vision-based end-to-end autonomous driving has emerged as a new paradigm. However, popular end-to-end approaches typically rely on visual feature extraction networks trained under label supervision. This limited supervision framework restricts the generality and applicability of driving models. In this paper, we propose a novel paradigm termed $E^{3}AD$, which advocates for comparative learning between visual feature extraction networks and the general EEG large model, in order to learn latent human driving cognition for enhancing end-to-end planning. In this work, we collected a cognitive dataset for the mentioned contrastive learning process. Subsequently, we investigated the methods and potential mechanisms for enhancing end-to-end planning with human driving cognition, using popular driving models as baselines on publicly available autonomous driving datasets. Both open-loop and closed-loop tests are conducted for a comprehensive evaluation of planning performance. Experimental results demonstrate that the $E^{3}AD$ paradigm significantly enhances the end-to-end planning performance of baseline models. Ablation studies further validate the contribution of driving cognition and the effectiveness of comparative learning process. To the best of our knowledge, this is the first work to integrate human driving cognition for improving end-to-end autonomous driving planning. It represents an initial attempt to incorporate embodied cognitive data into end-to-end autonomous driving, providing valuable insights for future brain-inspired autonomous driving systems. Our code will be made available at Github
comment: 24 pages,4 pages
☆ RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models
Vision-Language-Action (VLA) models have recently emerged as powerful general-purpose policies for robotic manipulation, benefiting from large-scale multi-modal pre-training. However, they often fail to generalize reliably in out-of-distribution deployments, where unavoidable disturbances such as observation noise, sensor errors, or actuation perturbations become prevalent. While recent Reinforcement Learning (RL)-based post-training provides a practical means to adapt pre-trained VLA models, existing methods mainly emphasize reward maximization and overlook robustness to environmental uncertainty. In this work, we introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models. Through a systematic robustness analysis, we identify two key regularizations: Jacobian regularization, which mitigates sensitivity to observation noise, and smoothness regularization, which stabilizes policies under action perturbations. Extensive experiments across diverse robotic environments demonstrate that RobustVLA significantly outperforms prior state-of-the-art methods in robustness and reliability. Our results highlight the importance of principled robustness-aware RL post-training as a key step toward improving the reliability and robustness of VLA models.
☆ A High-Speed Capable Spherical Robot
This paper designs a new spherical robot structure capable of supporting high-speed motion at up to 10 m/s. Building upon a single-pendulum-driven spherical robot, the design incorporates a momentum wheel with an axis aligned with the secondary pendulum, creating a novel spherical robot structure. Practical experiments with the physical prototype have demonstrated that this new spherical robot can achieve stable high-speed motion through simple decoupled control, which was unattainable with the original structure. The spherical robot designed for high-speed motion not only increases speed but also significantly enhances obstacle-crossing performance and terrain robustness.
comment: 5 pages
☆ Lyapunov Stability Learning with Nonlinear Control via Inductive Biases
Finding a control Lyapunov function (CLF) in a dynamical system with a controller is an effective way to guarantee stability, which is a crucial issue in safety-concerned applications. Recently, deep learning models representing CLFs have been applied into a learner-verifier framework to identify satisfiable candidates. However, the learner treats Lyapunov conditions as complex constraints for optimisation, which is hard to achieve global convergence. It is also too complicated to implement these Lyapunov conditions for verification. To improve this framework, we treat Lyapunov conditions as inductive biases and design a neural CLF and a CLF-based controller guided by this knowledge. This design enables a stable optimisation process with limited constraints, and allows end-to-end learning of both the CLF and the controller. Our approach achieves a higher convergence rate and larger region of attraction (ROA) in learning the CLF compared to existing methods among abundant experiment cases. We also thoroughly reveal why the success rate decreases with previous methods during learning.
comment: Accepted by IEEE Robio 2025
♻ ☆ Robotic Monitoring of Colorimetric Leaf Sensors for Precision Agriculture ICRA
Common remote sensing modalities (RGB, multispectral, hyperspectral imaging or LiDAR) are often used to indirectly measure crop health and do not directly capture plant stress indicators. Commercially available direct leaf sensors are bulky, powered electronics that are expensive and interfere with crop growth. In contrast, low-cost, passive and bio-degradable leaf sensors offer an opportunity to advance real-time monitoring as they directly interface with the crop surface while not interfering with crop growth. To this end, we co-design a sensor-detector system, where the sensor is a passive colorimetric leaf sensor that directly measures crop health in a precision agriculture setting, and the detector autonomously obtains optical signals from these leaf sensors. The detector comprises a low size weight and power (SWaP) mobile ground robot with an onboard monocular RGB camera and object detector to localize each leaf sensor, as well as a hyperspectral camera with a motorized mirror and halogen light to acquire hyperspectral images. The sensor's crop health-dependent optical signals can be extracted from the hyperspectral images. The proof-of-concept system is demonstrated in row-crop environments both indoors and outdoors where it is able to autonomously navigate, locate and obtain a hyperspectral image of all leaf sensors present, and acquire interpretable spectral resonance with 80 $\%$ accuracy within a required retrieval distance from the sensor.
comment: Revised version. Initial version was accepted to the Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots IEEE ICRA Workshop - 2025
♻ ☆ Kinematically Controllable Cable Robots with Reconfigurable End-effectors
To enlarge the translational workspace of cable-driven robots, one common approach is to increase the number of cables. However, this introduces two challenges: (1) cable interference significantly reduces the rotational workspace, and (2) the solution of tensions in cables becomes non-unique, resulting in difficulties for kinematic control of the robot. In this work, we design structurally simple reconfigurable end-effectors for cable robots. By incorporating a spring, a helical-grooved shaft, and a matching nut, relative linear motions between end-effector components are converted into relative rotations, thereby expanding the rotational workspace of the mechanism. Meanwhile, a bearing is introduced to provide an additional rotational degree of freedom, making the mechanism non-redundant. As a result, the robot's motion can be controlled purely through kinematics without additional tension sensing and control.
comment: 8 pages, 7 figures, Technical Report
♻ ☆ Mixed-Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution
Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.
comment: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESSAN) (under review)
♻ ☆ MarsLGPR: Mars Rover Localization with Ground Penetrating Radar
In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at https://umfieldrobotics.github.io/marslgpr.
♻ ☆ Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning
The rise of surgical robots and vision-language-action models has accelerated the development of autonomous surgical policies and efficient assessment strategies. However, evaluating these policies directly on physical robotic platforms such as the da Vinci Research Kit (dVRK) remains hindered by high costs, time demands, reproducibility challenges, and variability in execution. World foundation models (WFM) for physical AI offer a transformative approach to simulate complex real-world surgical tasks, such as soft tissue deformation, with high fidelity. This work introduces Cosmos-Surg-dVRK, a surgical finetune of the Cosmos WFM, which, together with a trained video classifier, enables fully automated online evaluation and benchmarking of surgical policies. We evaluate Cosmos-Surg-dVRK using two distinct surgical datasets. On tabletop suture pad tasks, the automated pipeline achieves strong correlation between online rollouts in Cosmos-Surg-dVRK and policy outcomes on the real dVRK Si platform, as well as good agreement between human labelers and the V-JEPA 2-derived video classifier. Additionally, preliminary experiments with ex-vivo porcine cholecystectomy tasks in Cosmos-Surg-dVRK demonstrate promising alignment with real-world evaluations, highlighting the platform's potential for more complex surgical procedures.
comment: minor metadata and notation fixes; +3 citations
♻ ☆ FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies NeurIPS 2025
Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.
comment: NeurIPS 2025; 28 pages, 14 figures, 10 tables; Code at https://flexevent.github.io/
♻ ☆ If They Disagree, Will You Conform? Exploring the Role of Robots' Value Awareness in a Decision-Making Task
This study investigates whether the opinions of robotic agents can influence human decision-making when robots display value awareness (i.e., the capability of understanding human preferences and prioritizing them in decision-making). We designed an experiment in which participants interacted with two Furhat robots - one programmed to be Value-Aware and the other Non-Value-Aware - during a labeling task for images representing human values. Results indicate that participants distinguished the Value-Aware robot from the Non-Value-Aware one. Although their explicit choices did not indicate a clear preference for one robot over the other, participants directed their gaze more toward the Value-Aware robot. Additionally, the Value-Aware robot was perceived as more loyal, suggesting that value awareness in a social robot may enhance its perceived commitment to the group. Finally, when both robots disagreed with the participant, conformity occurred in about one out of four trials, and participants took longer to confirm their responses, suggesting that two robots expressing dissent may introduce hesitation in decision-making. On one hand, this highlights the potential risk that robots, if misused, could manipulate users for unethical purposes. On the other hand, it reinforces the idea that social robots could encourage reflection in ambiguous situations and help users avoid scams.
comment: Pre-print version
♻ ☆ DW-A-PRM: A Dynamic Weighted Planner
Robot path planning plays a pivotal role in enabling autonomous systems to navigate safely and efficiently in complex and uncertain environments. Despite extensive research on classical graph-based methods and sampling-based planners, achieving an optimal balance between global optimality, computational efficiency, and adaptability to dynamic environments remains an open challenge. To address this issue, this paper proposes a hybrid path planning framework, which integrates heuristic-driven search with probabilistic roadmap construction under a dynamic weighting scheme. By coupling the global guidance of A* with the stochastic exploration of PRM, the method achieves a synergistic balance between search optimality and computational tractability. Comprehensive experiments in diverse simulated environments demonstrate that the proposed method consistently yields smoother and shorter paths while significantly reducing computational overhead compared with conventional approach and other hybrid planners. These results highlight the potential of the proposed framework as an effective and generalizable solution for real-time robotic navigation in complex environments.
♻ ☆ RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained by supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
comment: https://lei-kun.github.io/RL-100/
♻ ☆ Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning ICRA
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2025
♻ ☆ Infinite-Horizon Value Function Approximation for Model Predictive Control
Model Predictive Control has emerged as a popular tool for robots to generate complex motions. However, the real-time requirement has limited the use of hard constraints and large preview horizons, which are necessary to ensure safety and stability. In practice, practitioners have to carefully design cost functions that can imitate an infinite horizon formulation, which is tedious and often results in local minima. In this work, we study how to approximate the infinite horizon value function of constrained optimal control problems with neural networks using value iteration and trajectory optimization. Furthermore, we experimentally demonstrate how using this value function approximation as a terminal cost provides global stability to the model predictive controller. The approach is validated on two toy problems and a real-world scenario with online obstacle avoidance on an industrial manipulator where the value function is conditioned to the goal and obstacle.
♻ ☆ UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
comment: Accepted to RSS 2025. Code is available at https://github.com/OpenDriveLab/UniVLA
♻ ☆ A Helping (Human) Hand in Kinematic Structure Estimation ICRA25
Visual uncertainties such as occlusions, lack of texture, and noise present significant challenges in obtaining accurate kinematic models for safe robotic manipulation. We introduce a probabilistic real-time approach that leverages the human hand as a prior to mitigate these uncertainties. By tracking the constrained motion of the human hand during manipulation and explicitly modeling uncertainties in visual observations, our method reliably estimates an object's kinematic model online. We validate our approach on a novel dataset featuring challenging objects that are occluded during manipulation and offer limited articulations for perception. The results demonstrate that by incorporating an appropriate prior and explicitly accounting for uncertainties, our method produces accurate estimates, outperforming two recent baselines by 195% and 140%, respectively. Furthermore, we demonstrate that our approach's estimates are precise enough to allow a robot to manipulate even small objects safely.
comment: Accepted at ICRA25; 8 pages + 7 figures; For supplementary material, see https://www.tu.berlin/robotics/papers/helpinghands
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ Bellman Diffusion Models
Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.
♻ ☆ MARFT: Multi-Agent Reinforcement Fine-Tuning
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
comment: 42 pages
♻ ☆ A Time-dependent Risk-aware distributed Multi-Agent Path Finder based on A* IROS 2025
Multi-Agent Path-Finding (MAPF) focuses on the collaborative planning of paths for multiple agents within shared spaces, aiming for collision-free navigation. Conventional planning methods often overlook the presence of other agents, which can result in conflicts. In response, this article introduces the A$^*_+$T algorithm, a distributed approach that improves coordination among agents by anticipating their positions based on their movement speeds. The algorithm also considers dynamic obstacles, assessing potential collisions with respect to observed speeds and trajectories, thereby facilitating collision-free path planning in environments populated by other agents and moving objects. It incorporates a risk layer surrounding both dynamic and static entities, enhancing its utility in real-world applications. Each agent functions autonomously while being mindful of the paths chosen by others, effectively addressing the complexities inherent in multi-agent situations. The performance of A$^*_+$T has been rigorously tested in the Gazebo simulation environment and benchmarked against established approaches such as CBS, ECBS, and SIPP. Furthermore, the algorithm has shown competence in single-agent experiments, with results demonstrating its effectiveness in managing dynamic obstacles and affirming its practical relevance across various scenarios.
comment: 8 pages, 10 figures, 2 tabels, submited to IROS 2025
♻ ☆ DTAA: A Detect, Track and Avoid Architecture for navigation in spaces with Multiple Velocity Objects
Proactive collision avoidance measures are imperative in environments where humans and robots coexist. Moreover, the introduction of high quality legged robots into workplaces highlighted the crucial role of a robust, fully autonomous safety solution for robots to be viable in shared spaces or in co-existence with humans. This article establishes for the first time ever an innovative Detect-Track-and-Avoid Architecture (DTAA) to enhance safety and overall mission performance. The proposed novel architectyre has the merit ot integrating object detection using YOLOv8, utilizing Ultralytics embedded object tracking, and state estimation of tracked objects through Kalman filters. Moreover, a novel heuristic clustering is employed to facilitate active avoidance of multiple closely positioned objects with similar velocities, creating sets of unsafe spaces for the Nonlinear Model Predictive Controller (NMPC) to navigate around. The NMPC identifies the most hazardous unsafe space, considering not only their current positions but also their predicted future locations. In the sequel, the NMPC calculates maneuvers to guide the robot along a path planned by D$^{*}_{+}$ towards its intended destination, while maintaining a safe distance to all identified obstacles. The efficacy of the novelly suggested DTAA framework is being validated by Real-life experiments featuring a Boston Dynamics Spot robot that demonstrates the robot's capability to consistently maintain a safe distance from humans in dynamic subterranean, urban indoor, and outdoor environments.
♻ ☆ Spatiotemporal Calibration for Laser Vision Sensor in Hand-eye System Based on Straight-line Constraint
Laser vision sensors (LVS) are critical perception modules for industrial robots, facilitating real-time acquisition of workpiece geometric data in welding applications. However, the camera communication delay will lead to a temporal desynchronization between captured images and the robot motions. Additionally, hand-eye extrinsic parameters may vary during prolonged measurement. To address these issues, we introduce a measurement model of LVS considering the effect of the camera's time-offset and propose a teaching-free spatiotemporal calibration method utilizing line constraints. This method involves a robot equipped with an LVS repeatedly scanning straight-line fillet welds using S-shaped trajectories. Regardless of the robot's orientation changes, all measured welding positions are constrained to a straight-line, represented by Plucker coordinates. Moreover, a nonlinear optimization model based on straight-line constraints is established. Subsequently, the Levenberg-Marquardt algorithm (LMA) is employed to optimize parameters, including time-offset, hand-eye extrinsic parameters, and straight-line parameters. The feasibility and accuracy of the proposed approach are quantitatively validated through experiments on curved weld scanning. We open-sourced the code, dataset, and simulation report at https://anonymous.4open.science/r/LVS_ST_CALIB-015F/README.md.
comment: Submitted to IEEE RAL
♻ ☆ Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
♻ ☆ From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems EMNLP 2025
Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
comment: EMNLP 2025 camera ready
♻ ☆ Simultaneous System Identification and Model Predictive Control with No Dynamic Regret
We provide an algorithm for the simultaneous system identification and model predictive control of nonlinear systems. The algorithm has finite-time near-optimality guarantees and asymptotically converges to the optimal (non-causal) controller. Particularly, the algorithm enjoys sublinear dynamic regret, defined herein as the suboptimality against an optimal clairvoyant controller that knows how the unknown disturbances and system dynamics will adapt to its actions. The algorithm is self-supervised and applies to control-affine systems with unknown dynamics and disturbances that can be expressed in reproducing kernel Hilbert spaces. Such spaces can model external disturbances and modeling errors that can even be adaptive to the system's state and control input. For example, they can model wind and wave disturbances to aerial and marine vehicles, or inaccurate model parameters such as inertia of mechanical systems. The algorithm first generates random Fourier features that are used to approximate the unknown dynamics or disturbances. Then, it employs model predictive control based on the current learned model of the unknown dynamics (or disturbances). The model of the unknown dynamics is updated online using least squares based on the data collected while controlling the system. We validate our algorithm in both hardware experiments and physics-based simulations. The simulations include (i) a cart-pole aiming to maintain the pole upright despite inaccurate model parameters, and (ii) a quadrotor aiming to track reference trajectories despite unmodeled aerodynamic drag effects. The hardware experiments include a quadrotor aiming to track a circular trajectory despite unmodeled aerodynamic drag effects, ground effects, and wind disturbances.
comment: IEEE Transactions on Robotics (T-RO). v6 update on stability analysis in Appendix J under relaxed Assumption 1
♻ ☆ iKap: Kinematics-aware Planning with Imperative Learning
Trajectory planning in robotics aims to generate collision-free pose sequences that can be reliably executed. Recently, vision-to-planning systems have gained increasing attention for their efficiency and ability to interpret and adapt to surrounding environments. However, traditional modular systems suffer from increased latency and error propagation, while purely data-driven approaches often overlook the robot's kinematic constraints. This oversight leads to discrepancies between planned trajectories and those that are executable. To address these challenges, we propose iKap, a novel vision-to-planning system that integrates the robot's kinematic model directly into the learning pipeline. iKap employs a self-supervised learning approach and incorporates the state transition model within a differentiable bi-level optimization framework. This integration ensures the network learns collision-free waypoints while satisfying kinematic constraints, enabling gradient back-propagation for end-to-end training. Our experimental results demonstrate that iKap achieves higher success rates and reduced latency compared to the state-of-the-art methods. Besides the complete system, iKap offers a visual-to-planning network that seamlessly works with various controllers, providing a robust solution for robots navigating complex environments.
comment: 6 pages, 6 figures
♻ ☆ Dexterous Contact-Rich Manipulation via the Contact Trust Region
What is a good local description of contact dynamics for contact-rich manipulation, and where can we trust this local description? While many approaches often rely on the Taylor approximation of dynamics with an ellipsoidal trust region, we argue that such approaches are fundamentally inconsistent with the unilateral nature of contact. As a remedy, we present the Contact Trust Region (CTR), which captures the unilateral nature of contact while remaining efficient for computation. With CTR, we first develop a Model-Predictive Control (MPC) algorithm capable of synthesizing local contact-rich plans. Then, we extend this capability to plan globally by stitching together local MPC plans, enabling efficient and dexterous contact-rich manipulation. To verify the performance of our method, we perform comprehensive evaluations, both in high-fidelity simulation and on hardware, on two contact-rich systems: a planar IiwaBimanual system and a 3D AllegroHand system. On both systems, our method offers a significantly lower-compute alternative to existing RL-based approaches to contact-rich manipulation. In particular, our Allegro in-hand manipulation policy, in the form of a roadmap, takes fewer than 10 minutes to build offline on a standard laptop using just its CPU, with online inference taking just a few seconds. Experiment data, video and code are available at ctr.theaiinstitute.com.
♻ ☆ Adaptive Multirobot Virtual Structure Control using Dual Quaternions
This paper presents a control strategy based on dual quaternions for the coordinated formation flying of small UAV groups. A virtual structure is employed to define the desired formation, enabling unified control of its position, orientation, and shape. This abstraction makes formation management easier by allowing a low-level controller to compute individual UAV commands efficiently. The proposed controller integrates a pose control module with a geometry-based adaptive strategy, ensuring precise and robust task execution. The effectiveness of the approach is demonstrated through both simulation and experimental results.
♻ ☆ Integrated Shape-Force Estimation for Continuum Robots: A Virtual-Work and Polynomial-Curvature Framework
Cable-driven continuum robots (CDCRs) are widely used in surgical and inspection tasks that require dexterous manipulation in confined spaces. Existing model-based estimation methods either assume constant curvature or rely on geometry-space interpolants, both of which struggle with accuracy under large deformations and sparse sensing. This letter introduces an integrated shape-force estimation framework that combines cable-tension measurements with tip-pose data to reconstruct backbone shape and estimate external tip force simultaneously. The framework employs polynomial curvature kinematics (PCK) and a virtual-work-based static formulation expressed directly in curvature space, where polynomial modal coefficients serve as generalized coordinates. The proposed method is validated through Cosserat-rod-based simulations and hardware experiments on a torque-cell-enabled CDCR prototype. Results show that the second-order PCK model achieves superior shape and force accuracy, combining a lightweight shape optimization with a closed-form, iteration-free force estimation, offering a compact and robust alternative to prior constant-curvature and geometry-space approaches.
♻ ☆ Interactive Identification of Granular Materials using Force Measurements
Despite the potential the ability to identify granular materials creates for applications such as robotic cooking or earthmoving, granular material identification remains a challenging area, existing methods mostly relying on shaking the materials in closed containers. This work presents an interactive material identification framework that enables robots to identify a wide range of granular materials using only force-torque measurements. Unlike prior works, the proposed approach uses direct interaction with the materials. The approach is evaluated through experiments with a real-world dataset comprising 11 granular materials, which we also make publicly available. Results show that our method can identify a wide range of granular materials with near-perfect accuracy while relying solely on force measurements obtained from direct interaction. Further, our comprehensive data analysis and experiments show that a high-performancefeature space must combine features related to the force signal's time-domain dynamics and frequency spectrum. We account for this by proposing a combination of the raw signal and its high-frequency magnitude histogram as the suggested feature space representation. We show that the proposed feature space outperforms baselines by a significant margin. The code and data set are available at: https://irobotics.aalto.fi/identify_granular/.
comment: Accepted to 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
♻ ☆ Closing the Intent-to-Behavior Gap via Fulfillment Priority Logic
Practitioners designing reinforcement learning policies face a fundamental challenge: translating intended behavioral objectives into representative reward functions. This challenge stems from behavioral intent requiring simultaneous achievement of multiple competing objectives, typically addressed through labor-intensive linear reward composition that yields brittle results. Consider the ubiquitous robotics scenario where performance maximization directly conflicts with energy conservation. Such competitive dynamics are resistant to simple linear reward combinations. In this paper, we present the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL). FPL allows practitioners to define logical formula representing their intentions and priorities within multi-objective reinforcement learning. Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500\% better sample efficiency compared to Soft Actor Critic. Notably, this work constitutes the first implementation of non-linear utility scalarization design, specifically for continuous control problems.
♻ ☆ The Difference between the Left and Right Invariant Extended Kalman Filter
The extended Kalman filter (EKF) has been the industry standard for state estimation problems over the past sixty years. The Invariant Extended Kalman Filter (IEKF) is a recent development of the EKF for the class of group-affine systems on Lie groups that has shown superior performance for inertial navigation problems. The IEKF comes in two versions, left- and right- handed respectively, and there is a perception in the robotics community that these filters are different and one should choose the handedness of the IEKF to match handedness of the measurement model for a given filtering problem. In this paper, we revisit these algorithms and demonstrate that the left- and right- IEKF algorithms (with reset step) are identical, that is, the choice of the handedness does not affect the IEKF's performance when the reset step is properly implemented. The reset step was not originally proposed as part of the IEKF, however, we provide simulations to show that the reset step improves asymptotic performance of all versions of the the filter, and should be included in all high performance algorithms. The GNSS-aided inertial navigation system (INS) is used as a motivating example to demonstrate the equivalence of the two filters.
comment: 20 pages, 4 figures, submitted to Control Engineering Practice
♻ ☆ Learning Terrain-Specialized Policies for Adaptive Locomotion in Challenging Environments
Legged robots must exhibit robust and agile locomotion across diverse, unstructured terrains, a challenge exacerbated under blind locomotion settings where terrain information is unavailable. This work introduces a hierarchical reinforcement learning framework that leverages terrain-specialized policies and curriculum learning to enhance agility and tracking performance in complex environments. We validated our method on simulation, where our approach outperforms a generalist policy by up to 16% in success rate and achieves lower tracking errors as the velocity target increases, particularly on low-friction and discontinuous terrains, demonstrating superior adaptability and robustness across mixed-terrain scenarios.
comment: Accepted to the 22nd International Conference on Advanced Robotics (ICAR 2025). 7 pages
♻ ☆ End-to-End Crop Row Navigation via LiDAR-Based Deep Reinforcement Learning
Reliable navigation in under-canopy agricultural environments remains a challenge due to GNSS unreliability, cluttered rows, and variable lighting. To address these limitations, we present an end-to-end learning-based navigation system that maps raw 3D LiDAR data directly to control commands using a deep reinforcement learning policy trained entirely in simulation. Our method includes a voxel-based downsampling strategy that reduces LiDAR input size by 95.83%, enabling efficient policy learning without relying on labeled datasets or manually designed control interfaces. The policy was validated in simulation, achieving a 100% success rate in straight-row plantations and showing a gradual decline in performance as row curvature increased, tested across varying sinusoidal frequencies and amplitudes.
comment: Accepted to the 22nd International Conference on Advanced Robotics (ICAR 2025). 7 pages
♻ ☆ Multi-Objective Planning with Contextual Lexicographic Reward Preferences AAMAS
Autonomous agents are often required to plan under multiple objectives whose preference ordering varies based on context. The agent may encounter multiple contexts during its course of operation, each imposing a distinct lexicographic ordering over the objectives, with potentially different reward functions associated with each context. Existing approaches to multi-objective planning typically consider a single preference ordering over the objectives, across the state space, and do not support planning under multiple objective orderings within an environment. We present Contextual Lexicographic Markov Decision Process (CLMDP), a framework that enables planning under varying lexicographic objective orderings, depending on the context. In a CLMDP, both the objective ordering at a state and the associated reward functions are determined by the context. We employ a Bayesian approach to infer a state-context mapping from expert trajectories. Our algorithm to solve a CLMDP first computes a policy for each objective ordering and then combines them into a single context-aware policy that is valid and cycle-free. The effectiveness of the proposed approach is evaluated in simulation and using a mobile robot.
comment: 9 pages, 5 figures, 2 tables, To appear in Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2025
♻ ☆ MarsLGPR: Mars Rover Localization with Ground Penetrating Radar
In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at https://umfieldrobotics.github.io/marslgpr.
comment: IEEE Transactions on Field Robotics (2025)
Computer Vision and Pattern Recognition 153
☆ Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study
Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing "skin-to-bone" relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.
☆ Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization
Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.
☆ Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
comment: First two authors contributed equally
☆ Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images
The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.
comment: Submitted to ISBI 2026
☆ Locally-Supervised Global Image Restoration
We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.
☆ Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay's superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.
☆ TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.
comment: Preprint
☆ SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art
The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502
☆ PROPEX-RAG: Enhanced GraphRAG using Prompt-Driven Prompt Execution
Retrieval-Augmented Generation (RAG) has become a robust framework for enhancing Large Language Models (LLMs) with external knowledge. Recent advances in RAG have investigated graph based retrieval for intricate reasoning; however, the influence of prompt design on enhancing the retrieval and reasoning process is still considerably under-examined. In this paper, we present a prompt-driven GraphRAG framework that underscores the significance of prompt formulation in facilitating entity extraction, fact selection, and passage reranking for multi-hop question answering. Our approach creates a symbolic knowledge graph from text data by encoding entities and factual relationships as structured facts triples. We use LLMs selectively during online retrieval to perform semantic filtering and answer generation. We also use entity-guided graph traversal through Personalized PageRank (PPR) to support efficient, scalable retrieval based on the knowledge graph we built. Our system gets state-of-the-art performance on HotpotQA and 2WikiMultiHopQA, with F1 scores of 80.7% and 78.9%, and Recall@5 scores of 97.1% and 98.1%, respectively. These results show that prompt design is an important part of improving retrieval accuracy and response quality. This research lays the groundwork for more efficient and comprehensible multi-hop question-answering systems, highlighting the importance of prompt-aware graph reasoning.
comment: Accepted in PReMI 2025
☆ Fractional Diffusion Bridge Models NeurIPS 2025
We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr\"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr\'echet Inception Distance (FID) in unpaired image translation.
comment: To appear in NeurIPS 2025 proceedings. This version includes post-camera-ready revisions
☆ How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment
Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.
☆ UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs
Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION
☆ Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image
In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.
comment: 21 pages, 19 figures, accepted by TPAMI
☆ HGFreNet: Hop-hybrid GraphFomer for 3D Human Pose Estimation with Trajectory Consistency in Frequency Domain
2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.
☆ 3EED: Ground Everything Everywhere in 3D NeurIPS 2025
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
comment: NeurIPS 2025 DB Track; 29 pages, 17 figures, 10 tables; Project Page at https://project-3eed.github.io/
☆ Toward Strategy Identification and Subtask Decomposition In Task Exploration
This research builds on work in anticipatory human-machine interaction, a subfield of human-machine interaction where machines can facilitate advantageous interactions by anticipating a user's future state. The aim of this research is to further a machine's understanding of user knowledge, skill, and behavior in pursuit of implicit coordination. A task explorer pipeline was developed that uses clustering techniques, paired with factor analysis and string edit distance, to automatically identify key global and local strategies that are used to complete tasks. Global strategies identify generalized sets of actions used to complete tasks, while local strategies identify sequences that used those sets of actions in a similar composition. Additionally, meaningful subtasks of various lengths are identified within the tasks. The task explorer pipeline was able to automatically identify key strategies used to complete tasks and encode user runs with hierarchical subtask structures. In addition, a Task Explorer application was developed to easily review pipeline results. The task explorer pipeline can be easily modified to any action-based time-series data and the identified strategies and subtasks help to inform humans and machines on user knowledge, skill, and behavior.
☆ Probabilistic Robustness for Free? Revisiting Training via a Benchmark
Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.
☆ Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we demonstrate its effectiveness through in-depth analysis and real-world evaluations. Our project page is available at https://irpn-eai.github.io/UD-VLA.github.io/.
☆ Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond
Under-display ToF imaging aims to achieve accurate depth sensing through a ToF camera placed beneath a screen panel. However, transparent OLED (TOLED) layers introduce severe degradations-such as signal attenuation, multi-path interference (MPI), and temporal noise-that significantly compromise depth quality. To alleviate this drawback, we propose Learnable Fractional Reaction-Diffusion Dynamics (LFRD2), a hybrid framework that combines the expressive power of neural networks with the interpretability of physical modeling. Specifically, we implement a time-fractional reaction-diffusion module that enables iterative depth refinement with dynamically generated differential orders, capturing long-term dependencies. In addition, we introduce an efficient continuous convolution operator via coefficient prediction and repeated differentiation to further improve restoration quality. Experiments on four benchmark datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/wudiqx106/LFRD2.
☆ Progressive Translation of H&E to IHC with Enhanced Structural Fidelity
Compared to hematoxylin-eosin (H&E) staining, immunohistochemistry (IHC) not only maintains the structural features of tissue samples, but also provides high-resolution protein localization, which is essential for aiding in pathology diagnosis. Despite its diagnostic value, IHC remains a costly and labor-intensive technique. Its limited scalability and constraints in multiplexing further hinder widespread adoption, especially in resource-limited settings. Consequently, researchers are increasingly exploring computational stain translation techniques to synthesize IHC-equivalent images from H&E-stained slides, aiming to extract protein-level information more efficiently and cost-effectively. However, most existing stain translation techniques rely on a linearly weighted summation of multiple loss terms within a single objective function, strategy that often overlooks the interdepedence among these components-resulting in suboptimal image quality and an inability to simultaneously preserve structural authenticity and color fidelity. To address this limitation, we propose a novel network architecture that follows a progressive structure, incorporating color and cell border generation logic, which enables each visual aspect to be optimized in a stage-wise and decoupled manner. To validate the effectiveness of our proposed network architecture, we build upon the Adaptive Supervised PatchNCE (ASP) framework as our baseline. We introduce additional loss functions based on 3,3'-diaminobenzidine (DAB) chromogen concentration and image gradient, enhancing color fidelity and cell boundary clarity in the generated IHC images. By reconstructing the generation pipeline using our structure-color-cell boundary progressive mechanism, experiments on HER2 and ER datasets demonstrated that the model significantly improved visual quality and achieved finer structural details.
☆ UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback NeurIPS 2025
Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.
comment: NeurIPS 2025
☆ Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward
Reinforcement Learning (RL) has recently been incorporated into diffusion models, e.g., tasks such as text-to-image. However, directly applying existing RL methods to diffusion-based image restoration models is suboptimal, as the objective of restoration fundamentally differs from that of pure generation: it places greater emphasis on fidelity. In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models. First, through extensive experiments with various reward functions, we find that an effective reward can be derived from an Image Quality Assessment (IQA) model, instead of intuitive ground-truth-based supervision, which has already been optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover, our strategy focuses on using RL for challenging samples that are significantly distant from the ground truth, and our RL approach is innovatively implemented using MLLM-based IQA models to align distributions with high-quality images initially. As the samples approach the ground truth's distribution, RL is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples. Our strategy is plug-and-play that can be seamlessly applied to diffusion-based restoration models, boosting its performance across various restoration tasks. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed RL framework.
☆ Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
☆ Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
☆ Benchmark-Ready 3D Anatomical Shape Classification MICCAI 2025
Progress in anatomical 3D shape classification is limited by the complexity of mesh data and the lack of standardized benchmarks, highlighting the need for robust learning methods and reproducible evaluation. We introduce two key steps toward clinically and benchmark-ready anatomical shape classification via self-supervised graph autoencoding. We propose Precomputed Structural Pooling (PSPooling), a non-learnable mesh pooling operator designed for efficient and structure-preserving graph coarsening in 3D anatomical shape analysis. PSPooling precomputes node correspondence sets based on geometric proximity, enabling parallelizable and reversible pooling and unpooling operations with guaranteed support structure. This design avoids the sparsity and reconstruction issues of selection-based methods and the sequential overhead of edge contraction approaches, making it particularly suitable for high-resolution medical meshes. To demonstrate its effectiveness, we integrate PSPooling into a self-supervised graph autoencoder that learns anatomy-aware representations from unlabeled surface meshes. We evaluate the downstream benefits on MedShapeNet19, a new curated benchmark dataset we derive from MedShapeNet, consisting of 19 anatomical classes with standardized training, validation, and test splits. Experiments show that PSPooling significantly improves reconstruction fidelity and classification accuracy in low-label regimes, establishing a strong baseline for medical 3D shape learning. We hope that MedShapeNet19 will serve as a widely adopted benchmark for anatomical shape classification and further research in medical 3D shape analysis. Access the complete codebase, model weights, and dataset information here: https://github.com/TomasKrsicka/MedShapeNet19-PSPooling.
comment: Shape in Medical Imaging, ShapeMI 2025, Held in Conjunction with MICCAI 2025
☆ DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.
☆ Lite ENSAM: a lightweight cancer segmentation model for 3D Computed Tomography
Accurate tumor size measurement is a cornerstone of evaluating cancer treatment response. The most widely adopted standard for this purpose is the Response Evaluation Criteria in Solid Tumors (RECIST) v1.1, which relies on measuring the longest tumor diameter in a single plane. However, volumetric measurements have been shown to provide a more reliable assessment of treatment effect. Their clinical adoption has been limited, though, due to the labor-intensive nature of manual volumetric annotation. In this paper, we present Lite ENSAM, a lightweight adaptation of the ENSAM architecture designed for efficient volumetric tumor segmentation from CT scans annotated with RECIST annotations. Lite ENSAM was submitted to the MICCAI FLARE 2025 Task 1: Pan-cancer Segmentation in CT Scans, Subtask 2, where it achieved a Dice Similarity Coefficient (DSC) of 60.7% and a Normalized Surface Dice (NSD) of 63.6% on the hidden test set, and an average total RAM time of 50.6 GBs and an average inference time of 14.4 s on CPU on the public validation dataset.
☆ MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence
Multimodal large language models (MLLMs) have shown remarkable capabilities in cross-modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk-aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi-agent decision-making, the framework enables adaptive, risk-aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk-aware planning and coordinated multi-agent execution compared with state-of-the-art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM-enabled multi-agent systems in real-world environments.
comment: 3 figures, 1 table; under review at Multimedia Systems (Springer)
☆ Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation
The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.
☆ Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual Information Minimization (MIM) as an explicit constraint, coupled with per-path contrastive supervision to maintain semantic alignment. Such dual-objectives force PDF to yield robust semantic coverage and a generalizable embedding space. Ultimately, the remarkable embedding space are accessible at inference via one single forward pass, incurring negligible computational overhead. We instantiate PDF on multiple MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains are consistently achieved across various resolutions and model sizes, e.g., boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency, our 2B model surpasses its baseline by +2.6% using only half the computational budget.
☆ Generative Adversarial Synthesis and Deep Feature Discrimination of Brain Tumor MRI Images
Compared to traditional methods, Deep Learning (DL) becomes a key technology for computer vision tasks. Synthetic data generation is an interesting use case for DL, especially in the field of medical imaging such as Magnetic Resonance Imaging (MRI). The need for this task since the original MRI data is limited. The generation of realistic medical images is completely difficult and challenging. Generative Adversarial Networks (GANs) are useful for creating synthetic medical images. In this paper, we propose a DL based methodology for creating synthetic MRI data using the Deep Convolutional Generative Adversarial Network (DC-GAN) to address the problem of limited data. We also employ a Convolutional Neural Network (CNN) classifier to classify the brain tumor using synthetic data and real MRI data. CNN is used to evaluate the quality and utility of the synthetic images. The classification result demonstrates comparable performance on real and synthetic images, which validates the effectiveness of GAN-generated images for downstream tasks.
comment: 9 pagers, 8 Figures
☆ PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.
comment: 17pages,7 figures, 5 tabels
☆ NOA: a versatile, extensible tool for AI-based organoid analysis
AI tools can greatly enhance the analysis of organoid microscopy images, from detection and segmentation to feature extraction and classification. However, their limited accessibility to biologists without programming experience remains a major barrier, resulting in labor-intensive and largely manual workflows. Although a few AI models for organoid analysis have been developed, most existing tools remain narrowly focused on specific tasks. In this work, we introduce the Napari Organoid Analyzer (NOA), a general purpose graphical user interface to simplify AI-based organoid analysis. NOA integrates modules for detection, segmentation, tracking, feature extraction, custom feature annotation and ML-based feature prediction. It interfaces multiple state-of-the-art algorithms and is implemented as an open-source napari plugin for maximal flexibility and extensibility. We demonstrate the versatility of NOA through three case studies, involving the quantification of morphological changes during organoid differentiation, assessment of phototoxicity effects, and prediction of organoid viability and differentiation state. Together, these examples illustrate how NOA enables comprehensive, AI-driven organoid image analysis within an accessible and extensible framework.
☆ PCD-ReID: Occluded Person Re-Identification for Base Station Inspection
Occluded pedestrian re-identification (ReID) in base station environments is a critical task in computer vision, particularly for surveillance and security applications. This task faces numerous challenges, as occlusions often obscure key body features, increasing the complexity of identification. Traditional ResNet-based ReID algorithms often fail to address occlusions effectively, necessitating new ReID methods. We propose the PCD-ReID (Pedestrian Component Discrepancy) algorithm to address these issues. The contributions of this work are as follows: To tackle the occlusion problem, we design a Transformer-based PCD network capable of extracting shared component features, such as helmets and uniforms. To mitigate overfitting on public datasets, we collected new real-world patrol surveillance images for model training, covering six months, 10,000 individuals, and over 50,000 images. Comparative experiments with existing ReID algorithms demonstrate that our model achieves a mean Average Precision (mAP) of 79.0% and a Rank-1 accuracy of 82.7%, marking a 15.9% Rank-1 improvement over ResNet50-based methods. Experimental evaluations indicate that PCD-ReID effectively achieves occlusion-aware ReID performance for personnel in tower inspection scenarios, highlighting its potential for practical deployment in surveillance and security applications.
comment: 11 pages, 7 figures
☆ Driving scenario generation and evaluation using a structured layer representation and foundational models
Rare and challenging driving scenarios are critical for autonomous vehicle development. Since they are difficult to encounter, simulating or generating them using generative models is a popular approach. Following previous efforts to structure driving scenario representations in a layer model, we propose a structured five-layer model to improve the evaluation and generation of rare scenarios. We use this model alongside large foundational models to generate new driving scenarios using a data augmentation strategy. Unlike previous representations, our structure introduces subclasses and characteristics for every agent of the scenario, allowing us to compare them using an embedding specific to our layer-model. We study and adapt two metrics to evaluate the relevance of a synthetic dataset in the context of a structured representation: the diversity score estimates how different the scenarios of a dataset are from one another, while the originality score calculates how similar a synthetic dataset is from a real reference set. This paper showcases both metrics in different generation setup, as well as a qualitative evaluation of synthetic videos generated from structured scenario descriptions. The code and extended results can be found at https://github.com/Valgiz/5LMSG.
☆ NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation
Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.
comment: Under review
☆ Example-Based Feature Painting on Textures
In this work, we propose a system that covers the complete workflow for achieving controlled authoring and editing of textures that present distinctive local characteristics. These include various effects that change the surface appearance of materials, such as stains, tears, holes, abrasions, discoloration, and more. Such alterations are ubiquitous in nature, and including them in the synthesis process is crucial for generating realistic textures. We introduce a novel approach for creating textures with such blemishes, adopting a learning-based approach that leverages unlabeled examples. Our approach does not require manual annotations by the user; instead, it detects the appearance-altering features through unsupervised anomaly detection. The various textural features are then automatically clustered into semantically coherent groups, which are used to guide the conditional generation of images. Our pipeline as a whole goes from a small image collection to a versatile generative model that enables the user to interactively create and paint features on textures of arbitrary size. Notably, the algorithms we introduce for diffusion-based editing and infinite stationary texture generation are generic and should prove useful in other contexts as well. Project page: https://reality.tf.fau.de/pub/ardelean2025examplebased.html
comment: "\c{opyright} 2025 Andrei-Timotei Ardelean, Tim Weyrich. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Trans. Graph., Vol. 44, No. 6, https://doi.org/10.1145/3763301
☆ Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement NeurIPS 2025
Low-light image enhancement (LLIE) faces persistent challenges in balancing reconstruction fidelity with cross-scenario generalization. While existing methods predominantly focus on deterministic pixel-level mappings between paired low/normal-light images, they often neglect the continuous physical process of luminance transitions in real-world environments, leading to performance drop when normal-light references are unavailable. Inspired by empirical analysis of natural luminance dynamics revealing power-law distributed intensity transitions, this paper introduces Luminance-Aware Statistical Quantification (LASQ), a novel framework that reformulates LLIE as a statistical sampling process over hierarchical luminance distributions. Our LASQ re-conceptualizes luminance transition as a power-law distribution in intensity coordinate space that can be approximated by stratified power functions, therefore, replacing deterministic mappings with probabilistic sampling over continuous luminance layers. A diffusion forward process is designed to autonomously discover optimal transition paths between luminance layers, achieving unsupervised distribution emulation without normal-light references. In this way, it considerably improves the performance in practical situations, enabling more adaptable and versatile light restoration. This framework is also readily applicable to cases with normal-light references, where it achieves superior performance on domain-specific datasets alongside better generalization-ability across non-reference datasets.
comment: Accepted at NeurIPS 2025
☆ Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning
Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.
comment: 18 pages, 14 figures
☆ SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation
Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.
☆ EPAN: Robust Pedestrian Re-Identification via Enhanced Alignment Network for IoT Surveillance
Person re-identification (ReID) plays a pivotal role in computer vision, particularly in surveillance and security applications within IoT-enabled smart environments. This study introduces the Enhanced Pedestrian Alignment Network (EPAN), tailored for robust ReID across diverse IoT surveillance conditions. EPAN employs a dual-branch architecture to mitigate the impact of perspective and environmental changes, extracting alignment information under varying scales and viewpoints. Here, we demonstrate EPAN's strong feature extraction capabilities, achieving outstanding performance on the Inspection-Personnel dataset with a Rank-1 accuracy of 90.09% and a mean Average Precision (mAP) of 78.82%. This highlights EPAN's potential for real-world IoT applications, enabling effective and reliable person ReID across diverse cameras in surveillance and security systems. The code and data are available at: https://github.com/ggboy2580/EPAN
comment: 12 page, 5 figures
☆ SecDiff: Diffusion-Aided Secure Deep Joint Source-Channel Coding Against Adversarial Attacks
Deep joint source-channel coding (JSCC) has emerged as a promising paradigm for semantic communication, delivering significant performance gains over conventional separate coding schemes. However, existing JSCC frameworks remain vulnerable to physical-layer adversarial threats, such as pilot spoofing and subcarrier jamming, compromising semantic fidelity. In this paper, we propose SecDiff, a plug-and-play, diffusion-aided decoding framework that significantly enhances the security and robustness of deep JSCC under adversarial wireless environments. Different from prior diffusion-guided JSCC methods that suffer from high inference latency, SecDiff employs pseudoinverse-guided sampling and adaptive guidance weighting, enabling flexible step-size control and efficient semantic reconstruction. To counter jamming attacks, we introduce a power-based subcarrier masking strategy and recast recovery as a masked inpainting problem, solved via diffusion guidance. For pilot spoofing, we formulate channel estimation as a blind inverse problem and develop an expectation-minimization (EM)-driven reconstruction algorithm, guided jointly by reconstruction loss and a channel operator. Notably, our method alternates between pilot recovery and channel estimation, enabling joint refinement of both variables throughout the diffusion process. Extensive experiments over orthogonal frequency-division multiplexing (OFDM) channels under adversarial conditions show that SecDiff outperforms existing secure and generative JSCC baselines by achieving a favorable trade-off between reconstruction quality and computational cost. This balance makes SecDiff a promising step toward practical, low-latency, and attack-resilient semantic communications.
comment: 13 pages, 6 figures
☆ HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.
comment: 10 pages, 5figures. The Thirty-Ninth Annual Conference on Neural Information Processing Systems
☆ Efficiently Training A Flat Neural Network Before It has been Quantizated
Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the relationship between a well-trained NN and the quantized model, leading to considerable quantization error for PTQ. However, it is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model. In this paper, we firstly discover that a flat full precision neural network is crucial for low-bit quantization. To achieve this, we propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources. Specifically, both the Activation Quantization Error (AQE) and the Weight Quantization Error (WQE) are statistically modeled as independent Gaussian noises. We study several noise injection optimization methods to obtain a flat minimum. Experimental results attest to the effectiveness of our approach. These results open novel pathways for obtaining low-bit PTQ models.
comment: ongoing work, more results would be added
☆ When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA
Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE
☆ Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation
Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.
☆ Privacy Preserving Ordinal-Meta Learning with VLMs for Fine-Grained Fruit Quality Prediction
To effectively manage the wastage of perishable fruits, it is crucial to accurately predict their freshness or shelf life using non-invasive methods that rely on visual data. In this regard, deep learning techniques can offer a viable solution. However, obtaining fine-grained fruit freshness labels from experts is costly, leading to a scarcity of data. Closed proprietary Vision Language Models (VLMs), such as Gemini, have demonstrated strong performance in fruit freshness detection task in both zero-shot and few-shot settings. Nonetheless, food retail organizations are unable to utilize these proprietary models due to concerns related to data privacy, while existing open-source VLMs yield sub-optimal performance for the task. Fine-tuning these open-source models with limited data fails to achieve the performance levels of proprietary models. In this work, we introduce a Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm, designed to train smaller VLMs. This approach utilizes meta-learning to address data sparsity and leverages label ordinality, thereby achieving state-of-the-art performance in the fruit freshness classification task under both zero-shot and few-shot settings. Our method achieves an industry-standard accuracy of 92.71%, averaged across all fruits. Keywords: Fruit Quality Prediction, Vision Language Models, Meta Learning, Ordinal Regression
comment: 9 pages, 1 figure, 4 tables
☆ Contrast-Guided Cross-Modal Distillation for Thermal Object Detection
Robust perception at night remains challenging for thermal-infrared detection: low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion. Prior remedies either translate TIR to RGB and hope pixel fidelity transfers to detection -- making performance fragile to color or structure artifacts -- or fuse RGB and TIR at test time, which requires extra sensors, precise calibration, and higher runtime cost. Both lines can help in favorable conditions, but do not directly shape the thermal representation used by the detector. We keep mono-modality inference and tackle the root causes during training. Specifically, we introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class and pushing apart those of different classes -- suppressing duplicate and confusing detections -- and that inject cross-modal semantic priors by aligning the student's multi-level pyramid features with an RGB-trained teacher, thereby strengthening texture-poor thermal features without visible input at test time. In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.
☆ Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation
Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.
☆ UniSOT: A Unified Framework for Multi-Modality Single Object Tracking
Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.
comment: The paper has been accepted by TPAMI
☆ Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis CVPR
Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18\% compared to a non-interactive baseline. To validate the faithfulness of the agent's explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($\Delta$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.
comment: 12 pages, 3 figures. Under review at the Conference on Computer Vision and Pattern Recognition (CVPR) 2026
☆ Towards One-step Causal Video Generation via Adversarial Self-Distillation
Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extremely limited denoising steps. Our approach builds upon the Distribution Matching Distillation (DMD) framework and proposes a novel Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model's n-step denoising process with its (n+1)-step version at the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios (e.g., 1-2 steps). In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.
comment: Under double-blind review as a conference paper
☆ Extremal Contours: Gradient-driven contours for compact visual attribution
Faithful yet compact explanations for vision models remain a challenge, as commonly used dense perturbation masks are often fragmented and overfitted, needing careful post-processing. Here, we present a training-free explanation method that replaces dense masks with smooth tunable contours. A star-convex region is parameterized by a truncated Fourier series and optimized under an extremal preserve/delete objective using the classifier gradients. The approach guarantees a single, simply connected mask, cuts the number of free parameters by orders of magnitude, and yields stable boundary updates without cleanup. Restricting solutions to low-dimensional, smooth contours makes the method robust to adversarial masking artifacts. On ImageNet classifiers, it matches the extremal fidelity of dense masks while producing compact, interpretable regions with improved run-to-run consistency. Explicit area control also enables importance contour maps, yielding a transparent fidelity-area profiles. Finally, we extend the approach to multi-contour and show how it can localize multiple objects within the same framework. Across benchmarks, the method achieves higher relevance mass and lower complexity than gradient and perturbation based baselines, with especially strong gains on self-supervised DINO models where it improves relevance mass by over 15% and maintains positive faithfulness correlations.
☆ Semantic BIM enrichment for firefighting assets: Fire-ART dataset and panoramic image-based 3D reconstruction
Inventory management of firefighting assets is crucial for emergency preparedness, risk assessment, and on-site fire response. However, conventional methods are inefficient due to limited capabilities in automated asset recognition and reconstruction. To address the challenge, this research introduces the Fire-ART dataset and develops a panoramic image-based reconstruction approach for semantic enrichment of firefighting assets into BIM models. The Fire-ART dataset covers 15 fundamental assets, comprising 2,626 images and 6,627 instances, making it an extensive and publicly accessible dataset for asset recognition. In addition, the reconstruction approach integrates modified cube-map conversion and radius-based spherical camera projection to enhance recognition and localization accuracy. Through validations with two real-world case studies, the proposed approach achieves F1-scores of 73% and 88% and localization errors of 0.620 and 0.428 meters, respectively. The Fire-ART dataset and the reconstruction approach offer valuable resources and robust technical solutions to enhance the accurate digital management of fire safety equipment.
☆ SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment
Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23\%-86\% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.
☆ EREBUS: End-to-end Robust Event Based Underwater Simulation ICRA
The underwater domain presents a vast array of challenges for roboticists and computer vision researchers alike, such as poor lighting conditions and high dynamic range scenes. In these adverse conditions, traditional vision techniques struggle to adapt and lead to suboptimal performance. Event-based cameras present an attractive solution to this problem, mitigating the issues of traditional cameras by tracking changes in the footage on a frame-by-frame basis. In this paper, we introduce a pipeline which can be used to generate realistic synthetic data of an event-based camera mounted to an AUV (Autonomous Underwater Vehicle) in an underwater environment for training vision models. We demonstrate the effectiveness of our pipeline using the task of rock detection with poor visibility and suspended particulate matter, but the approach can be generalized to other underwater tasks.
comment: Accepted to ICRA AQUA2SIM Workshop 2025, 6 pages, 3 figures, conference paper
☆ CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.
comment: The paper has been accepted by the 33rd Pacific Conference on Computer Graphics and Applications (Pacific Graphics 2025)
☆ Expanding the Content-Style Frontier: a Balanced Subspace Blending Approach for Content-Style LoRA Fusion
Recent advancements in text-to-image diffusion models have significantly improved the personalization and stylization of generated images. However, previous studies have only assessed content similarity under a single style intensity. In our experiments, we observe that increasing style intensity leads to a significant loss of content features, resulting in a suboptimal content-style frontier. To address this, we propose a novel approach to expand the content-style frontier by leveraging Content-Style Subspace Blending and a Content-Style Balance loss. Our method improves content similarity across varying style intensities, significantly broadening the content-style frontier. Extensive experiments demonstrate that our approach outperforms existing techniques in both qualitative and quantitative evaluations, achieving superior content-style trade-off with significantly lower Inverted Generational Distance (IGD) and Generational Distance (GD) scores compared to current methods.
☆ MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement
Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM-based interactive segmentation has gained attention for its strong generalization, but most methods follow a single-point-to-single-object paradigm, which limits multi-lesion segmentation. Moreover, ViT backbones capture global context but often miss high-fidelity local details. We propose MIQ-SAM3D, a multi-instance 3D segmentation framework with a competitive query optimization strategy that shifts from single-point-to-single-mask to single-point-to-multi-instance. A prompt-conditioned instance-query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN-Transformer encoder injects CNN-derived boundary saliency into ViT self-attention via spatial gating. A competitively optimized query decoder then enables end-to-end, parallel, multi-instance prediction through inter-query competition. On LiTS17 and KiTS21 dataset, MIQ-SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi-lesion cases.
☆ $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and $20-30\%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.
comment: 7 pages, 5 figures, 4 tables
☆ RDTE-UNet: A Boundary and Detail Aware UNet for Precise Medical Image Segmentation
Medical image segmentation is essential for computer-assisted diagnosis and treatment planning, yet substantial anatomical variability and boundary ambiguity hinder reliable delineation of fine structures. We propose RDTE-UNet, a segmentation network that unifies local modeling with global context to strengthen boundary delineation and detail preservation. RDTE-UNet employs a hybrid ResBlock detail-aware Transformer backbone and three modules: ASBE for adaptive boundary enhancement, HVDA for fine-grained feature modeling, and EulerFF for fusion weighting guided by Euler's formula. Together, these components improve structural consistency and boundary accuracy across morphology, orientation, and scale. On Synapse and BUSI dataset, RDTE-UNet has achieved a comparable level in terms of segmentation accuracy and boundary quality.
☆ A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model
The rapid growth of deep learning has brought about powerful models that can handle various tasks, like identifying images and understanding language. However, adversarial attacks, an unnoticed alteration, can deceive models, leading to inaccurate predictions. In this paper, a generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations. The CLIP model's ability to align text and image representation helps incorporate natural language semantics with a guided loss to generate effective adversarial examples that look identical to the original inputs. This integration allows extensive scene manipulation, creating perturbations in multi-object environments specifically designed to deceive multilabel classifiers. Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder (SSAE) with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA), resulting in perturbations that both deceive classification models and maintain high structural similarity to the original images. The model was tested on various tasks across diverse black-box victim models. The experimental results show that our method performs competitively, achieving comparable or superior results to existing techniques, while preserving greater visual fidelity.
comment: 18 pages, 3 figures
☆ MVSMamba: Multi-View Stereo with State Space Model NeurIPS 2025
Robust feature representations are essential for learning-based Multi-View Stereo (MVS), which relies on accurate feature matching. Recent MVS methods leverage Transformers to capture long-range dependencies based on local features extracted by conventional feature pyramid networks. However, the quadratic complexity of Transformer-based MVS methods poses challenges to balance performance and efficiency. Motivated by the global modeling capability and linear complexity of the Mamba architecture, we propose MVSMamba, the first Mamba-based MVS network. MVSMamba enables efficient global feature aggregation with minimal computational overhead. To fully exploit Mamba's potential in MVS, we propose a Dynamic Mamba module (DM-module) based on a novel reference-centered dynamic scanning strategy, which enables: (1) Efficient intra- and inter-view feature interaction from the reference to source views, (2) Omnidirectional multi-view feature representations, and (3) Multi-scale global feature aggregation. Extensive experimental results demonstrate MVSMamba outperforms state-of-the-art MVS methods on the DTU dataset and the Tanks-and-Temples benchmark with both superior performance and efficiency. The source code is available at https://github.com/JianfeiJ/MVSMamba.
comment: Accepted by NeurIPS 2025
☆ Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models NeurIPS 2025
Recent advances in diffusion models have enabled high-quality synthesis of specific subjects, such as identities or objects. This capability, while unlocking new possibilities in content creation, also introduces significant privacy risks, as personalization techniques can be misused by malicious users to generate unauthorized content. Although several studies have attempted to counter this by generating adversarially perturbed samples designed to disrupt personalization, they rely on unrealistic assumptions and become ineffective in the presence of even a few clean images or under simple image transformations. To address these challenges, we shift the protection target from the images to the diffusion model itself to hinder the personalization of specific subjects, through our novel framework called Anti-Personalized Diffusion Models (APDM). We first provide a theoretical analysis demonstrating that a naive approach of existing loss functions to diffusion models is inherently incapable of ensuring convergence for robust anti-personalization. Motivated by this finding, we introduce Direct Protective Optimization (DPO), a novel loss function that effectively disrupts subject personalization in the target model without compromising generative quality. Moreover, we propose a new dual-path optimization strategy, coined Learning to Protect (L2P). By alternating between personalization and protection paths, L2P simulates future personalization trajectories and adaptively reinforces protection at each step. Experimental results demonstrate that our framework outperforms existing methods, achieving state-of-the-art performance in preventing unauthorized personalization. The code is available at https://github.com/KU-VGI/APDM.
comment: 26 pages, 9 figures, 16 tables, NeurIPS 2025
☆ REASON: Probability map-guided dual-branch fusion framework for gastric content assessment
Accurate assessment of gastric content from ultrasound is critical for stratifying aspiration risk at induction of general anesthesia. However, traditional methods rely on manual tracing of gastric antra and empirical formulas, which face significant limitations in both efficiency and accuracy. To address these challenges, a novel two-stage probability map-guided dual-branch fusion framework (REASON) for gastric content assessment is proposed. In stage 1, a segmentation model generates probability maps that suppress artifacts and highlight gastric anatomy. In stage 2, a dual-branch classifier fuses information from two standard views, right lateral decubitus (RLD) and supine (SUP), to improve the discrimination of learned features. Experimental results on a self-collected dataset demonstrate that the proposed framework outperforms current state-of-the-art approaches by a significant margin. This framework shows great promise for automated preoperative aspiration risk assessment, offering a more robust, efficient, and accurate solution for clinical practice.
comment: Under Review. 12 pages, 10 figures, 6 tables
☆ UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.
comment: Project page: https://maplebb.github.io/UniREditBench
☆ Detecting Generated Images by Fitting Natural Image Distributions NeurIPS 2025
The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of natural and generated images. To exploit this difference, we employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones, leveraging the property that their gradients reside in mutually orthogonal subspaces. This design enables a simple yet effective detection method: an image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images. Further more, to address diminishing manifold disparities in advanced generative models, we leverage normalizing flows to amplify detectable differences by extruding generated images away from the natural image manifold. Extensive experiments demonstrate the efficacy of this method. Code is available at https://github.com/tmlr-group/ConV.
comment: 25 pages, 9 figures, NeurIPS 2025 spotlight
☆ Adaptation of Foundation Models for Medical Image Analysis: Strategies, Challenges, and Future Directions
Foundation models (FMs) have emerged as a transformative paradigm in medical image analysis, offering the potential to provide generalizable, task-agnostic solutions across a wide range of clinical tasks and imaging modalities. Their capacity to learn transferable representations from large-scale data has the potential to address the limitations of conventional task-specific models. However, adaptation of FMs to real-world clinical practice remains constrained by key challenges, including domain shifts, limited availability of high-quality annotated data, substantial computational demands, and strict privacy requirements. This review presents a comprehensive assessment of strategies for adapting FMs to the specific demands of medical imaging. We examine approaches such as supervised fine-tuning, domain-specific pretraining, parameter-efficient fine-tuning, self-supervised learning, hybrid methods, and multimodal or cross-modal frameworks. For each, we evaluate reported performance gains, clinical applicability, and limitations, while identifying trade-offs and unresolved challenges that prior reviews have often overlooked. Beyond these established techniques, we also highlight emerging directions aimed at addressing current gaps. These include continual learning to enable dynamic deployment, federated and privacy-preserving approaches to safeguard sensitive data, hybrid self-supervised learning to enhance data efficiency, data-centric pipelines that combine synthetic generation with human-in-the-loop validation, and systematic benchmarking to assess robust generalization under real-world clinical variability. By outlining these strategies and associated research gaps, this review provides a roadmap for developing adaptive, trustworthy, and clinically integrated FMs capable of meeting the demands of real-world medical imaging.
☆ PRevivor: Reviving Ancient Chinese Paintings using Prior-Guided Color Transformers
Ancient Chinese paintings are a valuable cultural heritage that is damaged by irreversible color degradation. Reviving color-degraded paintings is extraordinarily difficult due to the complex chemistry mechanism. Progress is further slowed by the lack of comprehensive, high-quality datasets, which hampers the creation of end-to-end digital restoration tools. To revive colors, we propose PRevivor, a prior-guided color transformer that learns from recent paintings (e.g., Ming and Qing Dynasty) to restore ancient ones (e.g., Tang and Song Dynasty). To develop PRevivor, we decompose color restoration into two sequential sub-tasks: luminance enhancement and hue correction. For luminance enhancement, we employ two variational U-Nets and a multi-scale mapping module to translate faded luminance into restored counterparts. For hue correction, we design a dual-branch color query module guided by localized hue priors extracted from faded paintings. Specifically, one branch focuses attention on regions guided by masked priors, enforcing localized hue correction, whereas the other branch remains unconstrained to maintain a global reasoning capability. To evaluate PRevivor, we conduct extensive experiments against state-of-the-art colorization methods. The results demonstrate superior performance both quantitatively and qualitatively.
☆ MotionStream: Real-Time Video Generation with Interactive Motion Controls
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
comment: Project webpage: https://joonghyuk.com/motionstream-web/
☆ Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop
LiDAR semantic segmentation degrades in adverse weather because refraction, scattering, and point dropouts corrupt geometry. Prior work in weather simulation, mixing-based augmentation, domain randomization, and uncertainty or boundary regularization improves robustness but still overlooks structural vulnerabilities near boundaries, corners, and sparse regions. We present a Light Geometry-aware adapter. The module aligns azimuth and applies horizontal circular padding to preserve neighbor continuity across the 0~360 degree wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points and computes simple local statistics, which are compressed into compact geometry-aware cues. During training, these cues drive region-aware regularization that stabilizes predictions in structurally fragile areas. The adapter is plug and play, complements augmentation, and can be enabled only during training with negligible inference cost. We adopt a source-only cross-weather setup where models train on SemanticKITTI and are evaluated on SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by 7.9 percentage points over the data-centric augmentation baseline and by 0.6 points over the class-centric regularization baseline. These results indicate that geometry-driven regularization is a key direction for all-weather LiDAR segmentation.
☆ CenterMamba-SAM: Center-Prioritized Scanning and Temporal Prototypes for Brain Lesion Segmentation
Brain lesion segmentation remains challenging due to small, low-contrast lesions, anisotropic sampling, and cross-slice discontinuities. We propose CenterMamba-SAM, an end-to-end framework that freezes a pretrained backbone and trains only lightweight adapters for efficient fine-tuning. At its core is the CenterMamba encoder, which employs a novel 3x3 corner-axis-center short-sequence scanning strategy to enable center-prioritized, axis-reinforced, and diagonally compensated information aggregation. This design enhances sensitivity to weak boundaries and tiny foci while maintaining sparse yet effective feature representation. A memory-driven structural prompt generator maintains a prototype bank across neighboring slices, enabling automatic synthesis of reliable prompts without user interaction, thereby improving inter-slice coherence. The memory-augmented multi-scale decoder integrates memory attention modules at multiple levels, combining deep supervision with progressive refinement to restore fine details while preserving global consistency. Extensive experiments on public benchmarks demonstrate that CenterMamba-SAM achieves state-of-the-art performance in brain lesion segmentation.
☆ Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability
Transferable attacks generate adversarial examples on surrogate models to fool unknown victim models, posing real-world threats and growing research interest. Despite focusing on flat losses for transferable adversarial examples, recent studies still fall into suboptimal regions, especially the flat-yet-sharp areas, termed as deceptive flatness. In this paper, we introduce a novel black-box gradient-based transferable attack from a perspective of dual-order information. Specifically, we feasibly propose Adversarial Flatness (AF) to the deceptive flatness problem and a theoretical assurance for adversarial transferability. Based on this, using an efficient approximation of our objective, we instantiate our attack as Adversarial Flatness Attack (AFA), addressing the altered gradient sign issue. Additionally, to further improve the attack ability, we devise MonteCarlo Adversarial Sampling (MCAS) by enhancing the inner-loop sampling efficiency. The comprehensive results on ImageNet-compatible dataset demonstrate superiority over six baselines, generating adversarial examples in flatter regions and boosting transferability across model architectures. When tested on input transformation attacks or the Baidu Cloud API, our method outperforms baselines.
comment: Accepted by Pattern Recognition in Nov 01,2025
☆ Eyes on Target: Gaze-Aware Object Detection in Egocentric Video
Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.
comment: Accepted at RAAI 2025
☆ Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
☆ Saliency-Guided Domain Adaptation for Left-Hand Driving in Autonomous Steering
Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then finetuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.
☆ Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering
The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.
comment: 10 pages, 11 figures, 6 tables
☆ OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA
Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
☆ MoSa: Motion Generation with Scalable Autoregressive Modeling
We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web
☆ A Topology-Aware Graph Convolutional Network for Human Pose Similarity and Action Quality Assessment
Action Quality Assessment (AQA) requires fine-grained understanding of human motion and precise evaluation of pose similarity. This paper proposes a topology-aware Graph Convolutional Network (GCN) framework, termed GCN-PSN, which models the human skeleton as a graph to learn discriminative, topology-sensitive pose embeddings. Using a Siamese architecture trained with a contrastive regression objective, our method outperforms coordinate-based baselines and achieves competitive performance on AQA-7 and FineDiving benchmarks. Experimental results and ablation studies validate the effectiveness of leveraging skeletal topology for pose similarity and action quality assessment.
comment: 10 pages, 5 figures. Submitted as a computer vision paper in the cs.CV category
☆ LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping
Reconstructing large-scale colored point clouds is an important task in robotics, supporting perception, navigation, and scene understanding. Despite advances in LiDAR inertial visual odometry (LIVO), its performance remains highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation models, such as VGGT, suffer from limited scalability in large environments and inherently lack metric scale. To overcome these limitations, we propose LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion pipeline: First, a pre-fusion module with robust initialization refinement efficiently estimates VGGT poses and point clouds with coarse metric scale within each session. Then, a post-fusion module enhances cross-modal 3D similarity transformation, using bounding-box-based regularization to reduce scale distortions caused by inconsistent FOVs between LiDAR and camera sensors. Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT achieves dense, globally consistent colored point clouds and outperforms both VGGT-based methods and LIVO baselines. The implementation of our proposed novel color point cloud evaluation toolkit will be released as open source.
☆ Web-Scale Collection of Video Data for 4D Animal Reconstruction NeurIPS 2025
Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.
comment: NeurIPS 2025 Datasets and Benchmarks
☆ ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.
comment: Project Page: https://roverbench.github.io/
☆ MicroAUNet: Boundary-Enhanced Multi-scale Fusion with Knowledge Distillation for Colonoscopy Polyp Image Segmentation
Early and accurate segmentation of colorectal polyps is critical for reducing colorectal cancer mortality, which has been extensively explored by academia and industry. However, current deep learning-based polyp segmentation models either compromise clinical decision-making by providing ambiguous polyp margins in segmentation outputs or rely on heavy architectures with high computational complexity, resulting in insufficient inference speeds for real-time colorectal endoscopic applications. To address this problem, we propose MicroAUNet, a light-weighted attention-based segmentation network that combines depthwise-separable dilated convolutions with a single-path, parameter-shared channel-spatial attention block to strengthen multi-scale boundary features. On the basis of it, a progressive two-stage knowledge-distillation scheme is introduced to transfer semantic and boundary cues from a high-capacity teacher. Extensive experiments on benchmarks also demonstrate the state-of-the-art accuracy under extremely low model complexity, indicating that MicroAUNet is suitable for real-time clinical polyp segmentation. The code is publicly available at https://github.com/JeremyXSC/MicroAUNet.
comment: Work in progress
☆ Few-Shot Multimodal Medical Imaging: A Theoretical Framework
Medical imaging relies heavily on large, labeled datasets. But, unfortunately, they are not always easily accessible in clinical settings. Additionally, many practitioners often face various structural obstacles like limited data availability, fragmented data systems, and unbalanced datasets. These barriers often lead to the increased diagnostic uncertainty, underrepresentation of certain conditions, reduced model robustness, and biased diagnostic decisions. In response to these challenges, approaches such as transfer learning, meta-learning, and multimodal fusion have made great strides. However, they still need a solid theoretical justification for why they succeed or fail in situations where data is scarce. To address this gap, we propose a unified theoretical framework that characterizes learning and inference under low-resource medical imaging conditions. We first formalize the learning objective under few-shot conditions and compute sample complexity constraints to estimate the smallest quantity of data needed to achieve clinically reliable accuracy. Then based on ideas from PAC-learning and PAC-Bayesian theory, we explain how multimodal integration encourages generalization and quantifies uncertainty under sparse supervision. We further propose a formal metric for explanation stability, offering interpretability guarantees under low-data conditions. Taken together, the proposed framework establishes a principled foundation for constructing dependable, data-efficient diagnostic systems by jointly characterizing sample efficiency, uncertainty quantification, and interpretability in a unified theoretical setting.
comment: 6 Pages
☆ Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis
Human-interpretable predictions are essential for deploying AI in medical imaging, yet most interpretable-by-design (IBD) frameworks require concept annotations for training data, which are costly and impractical to obtain in clinical contexts. Recent attempts to bypass annotation, such as zero-shot vision-language models or concept-generation frameworks, struggle to capture domain-specific medical features, leading to poor reliability. In this paper, we propose a novel Prior-guided Concept Predictor (PCP), a weakly supervised framework that enables concept answer prediction without explicit supervision or reliance on language models. PCP leverages class-level concept priors as weak supervision and incorporates a refinement mechanism with KL divergence and entropy regularization to align predictions with clinical reasoning. Experiments on PH2 (dermoscopy) and WBCatt (hematology) show that PCP improves concept-level F1-score by over 33% compared to zero-shot baselines, while delivering competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, and CXR4) relative to fully supervised concept bottleneck models (CBMs) and V-IP.
☆ Boosting performance of computer vision applications through embedded GPUs on the edge
Computer vision applications, especially those using augmented reality technology, are becoming quite popular in mobile devices. However, this type of application is known as presenting significant demands regarding resources. In order to enable its utilization in devices with more modest resources, edge computing can be used to offload certain high intensive tasks. Still, edge computing is usually composed of devices with limited capacity, which may impact in users quality of experience when using computer vision applications. This work proposes the use of embedded devices with graphics processing units (GPUs) to overcome such limitation. Experiments performed shown that GPUs can attain a performance gain when compared to using only CPUs, which guarantee a better experience to users using such kind of application.
comment: 4 pages, 6 figures
♻ ☆ WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild NeurIPS 2025
Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.
comment: Accepted to NeurIPS 2025. Project page: https://wildcat3d.github.io
♻ ☆ RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features
Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow's core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model's outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40\% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.
♻ ☆ Non-Contact Health Monitoring During Daily Personal Care Routines
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
comment: IEEE BSN 2025
♻ ☆ ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers ICCV 2025
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
comment: Published as a conference paper at ICCV 2025. Project page: https://github.com/icip-cas/ShortV
♻ ☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
♻ ☆ FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies NeurIPS 2025
Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.
comment: NeurIPS 2025; 28 pages, 14 figures, 10 tables; Code at https://flexevent.github.io/
♻ ☆ SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding NeurIPS 2025
Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
comment: NeurIPS 2025; 24 pages, 10 figures, 9 tables; Code at https://dekai21.github.io/SPIRAL/
♻ ☆ Efficient Remote Sensing Change Detection with Change State Space Models
Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.
♻ ☆ Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity NeurIPS 2025
Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.
comment: Accepted at NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting
In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.
♻ ☆ A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices
The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies such as hunting, fencing, use of repellents, and scare tactics. This underscores a critical need for intelligent, autonomous solutions capable of real-time deer detection and deterrence. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the viability of deer detection systems on edge devices. To address this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. We introduce a curated, publicly available dataset of 3,095 annotated images with bounding box annotations of deer. Then, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures (v8 to v11). Finally, we evaluated their performance on two representative edge computing platforms: the CPU-based Raspberry Pi 5 and the GPU-accelerated NVIDIA Jetson AGX Xavier to assess feasibility for real-world field deployment. Results show that the real-time detection performance is not feasible on Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 frames per second (FPS) with 's' and 'n' series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (Average Precision (AP) > 0.85) and computational efficiency (Inference Time < 34 milliseconds).
comment: 13 pages, 7 figures
♻ ☆ TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery
Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.
♻ ☆ New multimodal similarity measure for image registration via modeling local functional dependence with linear combination of learned basis functions
The deformable registration of images of different modalities, essential in many medical imaging applications, remains challenging. The main challenge is developing a robust measure for image overlap despite the compared images capturing different aspects of the underlying tissue. Here, we explore similarity metrics based on functional dependence between intensity values of registered images. Although functional dependence is too restrictive on the global scale, earlier work has shown competitive performance in deformable registration when such measures are applied over small enough contexts. We confirm this finding and further develop the idea by modeling local functional dependence via the linear basis function model with the basis functions learned jointly with the deformation. The measure can be implemented via convolutions, making it efficient to compute on GPUs. We release the method as an easy-to-use tool and show good performance on three datasets compared to well-established baseline and earlier functional dependence-based methods.
comment: Improved experimental setup
♻ ☆ GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
♻ ☆ Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception ACM MM 2025
4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representations' surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects' surfaces and interiors. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism in open-world scenarios, significantly outperforming state-of-the-art methods.
comment: Accepted by ACM MM 2025. Project Page: https://jiajinglin.github.io/Phys4DGen
♻ ☆ Understanding Ice Crystal Habit Diversity with Self-Supervised Learning NeurIPS 2025
Ice-containing clouds strongly impact climate, but they are hard to model due to ice crystal habit (i.e., shape) diversity. We use self-supervised learning (SSL) to learn latent representations of crystals from ice crystal imagery. By pre-training a vision transformer with many cloud particle images, we learn robust representations of crystal morphology, which can be used for various science-driven tasks. Our key contributions include (1) validating that our SSL approach can be used to learn meaningful representations, and (2) presenting a relevant application where we quantify ice crystal diversity with these latent representations. Our results demonstrate the power of SSL-driven representations to improve the characterization of ice crystals and subsequently constrain their role in Earth's climate system.
comment: Accepted to NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
♻ ☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
comment: Project Page: https://czzzzh.github.io/FreeArt3D Code: https://github.com/CzzzzH/FreeArt3D
♻ ☆ MediQ-GAN: Quantum-Inspired GAN for High Resolution Medical Image Generation
Machine learning-assisted diagnosis shows promise, yet medical imaging datasets are often scarce, imbalanced, and constrained by privacy, making data augmentation essential. Classical generative models typically demand extensive computational and sample resources. Quantum computing offers a promising alternative, but existing quantum-based image generation methods remain limited in scale and often face barren plateaus. We present MediQ-GAN, a quantum-inspired GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Its variational quantum circuits inherently preserve full-rank mappings, avoid rank collapse, and are theory-guided to balance expressivity with trainability. Beyond generation quality, we provide the first latent-geometry and rank-based analysis of quantum-inspired GANs, offering theoretical insight into their performance. Across three medical imaging datasets, MediQ-GAN outperforms state-of-the-art GANs and diffusion models. While validated on IBM hardware for robustness, our contribution is hardware-agnostic, offering a scalable and data-efficient framework for medical image generation and augmentation.
♻ ☆ Real World Federated Learning with a Knowledge Distilled Transformer for Cardiac CT Imaging
Federated learning is a renowned technique for utilizing decentralized data while preserving privacy. However, real-world applications often face challenges like partially labeled datasets, where only a few locations have certain expert annotations, leaving large portions of unlabeled data unused. Leveraging these could enhance transformer architectures ability in regimes with small and diversely annotated sets. We conduct the largest federated cardiac CT analysis to date (n=8,104) in a real-world setting across eight hospitals. Our two-step semi-supervised strategy distills knowledge from task-specific CNNs into a transformer. First, CNNs predict on unlabeled data per label type and then the transformer learns from these predictions with label-specific heads. This improves predictive accuracy and enables simultaneous learning of all partial labels across the federation, and outperforms UNet-based models in generalizability on downstream tasks. Code and model weights are made openly available for leveraging future cardiac CT analysis.
♻ ☆ GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field
Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on manual tools (e.g., Matplotlib and GeoGebra) to generate precise diagrams, but this usually requires huge, complicated calculations. Recently, researchers start to work on model-based methods (e.g., Stable Diffusion and GPT5) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF, to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements (e.g., points, segments, and circles) in the SDF, then construct a series of constraint functions to represent geometric relationships. Next, we optimize those constructed constraint functions to get an optimized field of both elements and constraints. Finally, by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to represent geometric elements and constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, through both qualitative and quantitative analysis, GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. We achieve 88.67\% synthesis accuracy by human evaluation in the IMO problem set. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95\% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
♻ ☆ What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.
♻ ☆ A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.
♻ ☆ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
comment: Code: https://github.com/Bytedance/Sa2VA
♻ ☆ Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints
An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.
comment: 11 pages, 18 figures. Accepted version
♻ ☆ Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets
Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.
comment: Submitted to Computer Science Journal. Version with updated acknowledgments
♻ ☆ Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.
♻ ☆ ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions
Images captured in challenging environments--such as nighttime, smoke, rainy weather, and underwater--often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed ``ReviveDiff'', which can address various degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Geospatial Foundation Models to Enable Progress on Sustainable Development Goals
Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.
♻ ☆ Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
♻ ☆ VidText: Towards Comprehensive Evaluation for Video Text Understanding
Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
♻ ☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
♻ ☆ Finite element-based space-time total variation-type regularization of the inverse problem in electrocardiographic imaging
Reconstructing cardiac electrical activity from body surface electric potential measurements results in the severely ill-posed inverse problem in electrocardiography. Many different regularization approaches have been proposed to improve numerical results and provide unique results. This work presents a novel approach for reconstructing the epicardial potential from body surface potential maps based on a space-time total variation-type regularization using finite elements, where a first-order primal-dual algorithm solves the underlying convex optimization problem. In several numerical experiments, the superior performance of this method and the benefit of space-time regularization for the reconstruction of epicardial potential on two-dimensional torso data and a three-dimensional rabbit heart compared to state-of-the-art methods are demonstrated.
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach
Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.
♻ ☆ Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review
Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of Foundation Models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Foundation Models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical scene understanding. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.
♻ ☆ Adjustable Spatio-Spectral Hyperspectral Image Compression Network
With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder module; 2) spatial encoder module; 3) compression ratio (CR) adapter encoder module; 4) CR adapter decoder module; 5) spatial decoder module; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on three HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models, surpassing the state of the art by up to 2.36 dB in terms of PSNR. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .
♻ ☆ Class Agnostic Instance-level Descriptor for Visual Instance Search
Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.
♻ ☆ SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning NeurIPS 2025
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.
comment: Accepted by NeurIPS 2025
♻ ☆ 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon
The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir's surface area and, subsequently, its volume in near real time. Sentinel-2 and Landsat 1-9 images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume derived from a reservoir bathymetric survey. The model is then able to estimate the water body's volume solely from the extracted water surface, without the need for any ground-based measurements. Water segmentation using the proposed index aligns with ground truth for over 95% of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance, with an error below 1.5% of the full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the method's robustness and cost-effectiveness, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology is applicable to other water bodies and generates over five decades of time-series data, offering valuable insights into climate change and environmental dynamics, with an emphasis on capturing temporal trends rather than exact water volume measurements.
♻ ☆ Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding NeurIPS 2025
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
comment: Accepted to NeurIPS 2025
♻ ☆ FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error CVPR 2025
The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.
comment: 14 pages, 14 figures. Accepted to CVPR 2025
♻ ☆ TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
comment: 27 pages, 15 figures. Dataset Website: https://textatlas5m.github.io
♻ ☆ Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
♻ ☆ Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.
♻ ☆ Epistemic Uncertainty for Generated Image Detection NeurIPS 2025
We introduce a novel framework for AI-generated image detection through epistemic uncertainty, aiming to address critical security concerns in the era of generative models. Our key insight stems from the observation that distributional discrepancies between training and testing data manifest distinctively in the epistemic uncertainty space of machine learning models. In this context, the distribution shift between natural and generated images leads to elevated epistemic uncertainty in models trained on natural images when evaluating generated ones. Hence, we exploit this phenomenon by using epistemic uncertainty as a proxy for detecting generated images. This converts the challenge of generated image detection into the problem of uncertainty estimation, underscoring the generalization performance of the model used for uncertainty estimation. Fortunately, advanced large-scale vision models pre-trained on extensive natural images have shown excellent generalization performance for various scenarios. Thus, we utilize these pre-trained models to estimate the epistemic uncertainty of images and flag those with high uncertainty as generated. Extensive experiments demonstrate the efficacy of our method. Code is available at https://github.com/tmlr-group/WePe.
comment: 28 pages, 10 figures, NeurIPS 2025
♻ ☆ Aligning Effective Tokens with Video Anomaly in Large Language Models
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.
♻ ☆ ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
♻ ☆ Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID BMVC
This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model's internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model's internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. *The code will be made public soon.*
comment: The 36th British Machine Vision Conference (BMVC)
♻ ☆ Multi-Focused Video Group Activities Hashing
With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.
♻ ☆ MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.
comment: Published in Transactions on Machine Learning Research (09/2025)
♻ ☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
♻ ☆ Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
comment: Accepted for publication in IEEE Transactions on Medical Imaging (IEEE TMI)
♻ ☆ From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China
The high cost of acquiring rural street view images has constrained comprehensive environmental perception in rural areas. Drone photographs, with their advantages of easy acquisition, broad coverage, and high spatial resolution, offer a viable approach for large-scale rural environmental perception. However, a systematic methodology for identifying key environmental elements from drone photographs and quantifying their impact on environmental perception remains lacking. To address this gap, a Vision-Language Contrastive Ranking Framework (VLCR) is designed for rural livability assessment in China. The framework employs chain-of-thought prompting strategies to guide multimodal large language models (MLLMs) in identifying visual features related to quality of life and ecological habitability from drone photographs. Subsequently, to address the instability in pairwise village comparison, a text description-constrained drone photograph comparison strategy is proposed. Finally, to overcome the efficiency bottleneck in nationwide pairwise village comparisons, an innovation ranking algorithm based on binary search interpolation is developed, which reduces the number of comparisons through automated selection of comparison targets. The proposed framework achieves superior performance with a Spearman Footrule distance of 0.74, outperforming mainstream commercial MLLMs by approximately 0.1. Moreover, the mechanism of concurrent comparison and ranking demonstrates a threefold enhancement in computational efficiency. Our framework has achieved data innovation and methodological breakthroughs in village livability assessment, providing strong support for large-scale village livability analysis. Keywords: Drone photographs, Environmental perception, Rural livability assessment, Multimodal large language models, Chain-of-thought prompting.
♻ ☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: v2 note: Corrected numerical values in Table 2 and Figure 4 due to a minor calculation error in v1. The overall conclusions remain unchanged. Code and models available at: https://github.com/tianciB/VFM-VAE
♻ ☆ V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception NeurIPS 2025
Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar, a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets. We will release all datasets and benchmark codebase at https://huggingface.co/datasets/yanglei18/V2X-Radar and https://github.com/yanglei18/V2X-Radar.
comment: NeurIPS 2025 Spotlight
♻ ☆ SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
♻ ☆ Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
comment: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabels
♻ ☆ Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism NeurIPS 2025
Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.
comment: 26 pages, 10 figures, accepted to NeurIPS 2025
♻ ☆ DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches
Stereo depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous works have shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated textures perform poorly in physical implementations, i.e., when deployed as patches, limiting their practical utility for stress-testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals among the repeated textures, creating a grid structure, significantly enhances the patch's attack performance. Through extensive experimentation, we analyze how variations of this novel structure influence the adversarial effectiveness. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the interval structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack advanced stereo depth estimation methods of different paradigms, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems. The code is officially released at: https://github.com/WiWiN42/DepthVanish
♻ ☆ DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
comment: 14 pages
♻ ☆ MotionGPT3: Human Motion as a Second Modality
With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
comment: 26 pages, 11 figures
♻ ☆ Risk-adaptive Activation Steering for Safe Multimodal Large Language Models
One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.
♻ ☆ Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models NeurIPS 2025
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
comment: Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/
♻ ☆ AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning
Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
comment: 13pages,7 figures
Artificial Intelligence 85
♻ ☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents
Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v4: fixed links in comments. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code and examples at https://tabarena.ai/code-examples and dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
comment: 6 pages
♻ ☆ Automotive Crash Dynamics Modeling Accelerated with Machine Learning
Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
♻ ☆ Non-Contact Health Monitoring During Daily Personal Care Routines
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
comment: IEEE BSN 2025
♻ ☆ Mixed-Density Diffuser: Efficient Planning with Non-uniform Temporal Resolution
Recent studies demonstrate that diffusion planners benefit from sparse-step planning over single-step planning. Training models to skip steps in their trajectories helps capture long-term dependencies without additional or memory computational cost. However, predicting excessively sparse plans degrades performance. We hypothesize this temporal density threshold is non-uniform across a temporal horizon and that certain parts of a planned trajectory should be more densely planned. We propose Mixed Density Diffuser (MDD), a diffusion planner where the densities throughout the horizon are tunable hyperparameters. MDD achieves a new SOTA across the Maze2D, Franka Kitchen, and Antmaze D4RL task domains.
comment: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESSAN) (under review)
♻ ☆ AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.
comment: Accepted by IEEE TASLP 2025. Demopage: https://amphionspace.github.io/anyenhance. Open-source implementation: https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc
♻ ☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
♻ ☆ Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation
SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.
♻ ☆ Identity Increases Stability in Neural Cellular Automata
Neural Cellular Automata (NCAs) offer a way to study the growth of two-dimensional artificial organisms from a single seed cell. From the outset, NCA-grown organisms have had issues with stability, their natural boundary often breaking down and exhibiting tumour-like growth or failing to maintain the expected shape. In this paper, we present a method for improving the stability of NCA-grown organisms by introducing an 'identity' layer with simple constraints during training. Results show that NCAs grown in close proximity are more stable compared with the original NCA model. Moreover, only a single identity value is required to achieve this increase in stability. We observe emergent movement from the stable organisms, with increasing prevalence for models with multiple identity values. This work lays the foundation for further study of the interaction between NCA-grown organisms, paving the way for studying social interaction at a cellular level in artificial organisms. Code/Videos available at: https://github.com/jstovold/ALIFE2025
comment: Accepted to ALIFE 2025
♻ ☆ Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity NeurIPS 2025
Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.
comment: Accepted at NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation
In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users' responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users' recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54\% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.
♻ ☆ Combinatorial Creativity: A New Frontier in Generalization Abilities
Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
comment: Preprint. The first two authors contributed equally
♻ ☆ A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design
Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.
comment: 4 pages, 1 figure
♻ ☆ Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models
Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose \eval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. \eval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.
comment: 17 Pages, 5 figures
♻ ☆ TinyDef-DETR: A Transformer-Based Framework for Defect Detection in Transmission Lines from UAV Imagery
Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult objects. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous objects.
♻ ☆ Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.
comment: NeruIPS 2025
♻ ☆ Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.
comment: Neurips '25 version
♻ ☆ RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained by supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
comment: https://lei-kun.github.io/RL-100/
♻ ☆ Augmenting learning in neuro-embodied systems through neurobiological first principles
Recent progress in artificial intelligence (AI) has been driven by insights from physics and neuroscience, particularly through the development of artificial neural networks (ANNs) capable of complex cognitive tasks such as vision and language processing. Despite these advances, they struggle with continual learning, adaptable knowledge transfer, robustness, and resource efficiency -- capabilities that biological systems handle seamlessly. Specifically, neuromorphic systems and artificial neural networks often overlook two key biophysical properties of neural circuits: neuronal diversity and cell-specific neuromodulation. These mechanisms, essential for regulating dynamic learning across brain scales, allow neuromodulators to introduce degeneracy in biological neural networks, ensuring stability and adaptability under changing conditions. In this article, we summarize recent bioinspired models, learning rules, and architectures, and propose a framework for augmenting ANNs, which has the potential to bridge the gap between neuroscience and AI through neurobiological first principles. Our proposed dual-framework approach leverages spiking neural networks to emulate diverse spiking behaviors and dendritic compartmental dynamics, thereby simulating the morphological and functional diversity of neuronal computations. Finally, we outline how integrating these biophysical principles into task-driven spiking neural networks and neuromorphic systems provides scalable solutions for continual learning, adaptability, robustness, and resource-efficiency. Additionally, this approach will not only provide insights into how emergent behaviors arise in neural networks but also catalyze the development of more efficient, reliable, and intelligent neuromorphic systems and robotic agents.
comment: 26 pages, 4 figures, 3 boxes, 1 table
♻ ☆ Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning ICRA
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2025
♻ ☆ What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)
While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.
♻ ☆ DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation
Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model's representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.
♻ ☆ Flat Channels to Infinity in Neural Loss Landscapes NeurIPS'25
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
comment: Accepted to NeurIPS'25
♻ ☆ GroupSHAP-Guided Integration of Financial News Keywords and Technical Indicators for Stock Price Prediction
Recent advances in finance-specific language models such as FinBERT have enabled the quantification of public sentiment into index-based measures, yet compressing diverse linguistic signals into single metrics overlooks contextual nuances and limits interpretability. To address this limitation, explainable AI techniques, particularly SHAP (SHapley Additive Explanations), have been employed to identify influential features. However, SHAP's computational cost grows exponentially with input features, making it impractical for large-scale text-based financial data. This study introduces a GRU-based forecasting framework enhanced with GroupSHAP, which quantifies contributions of semantically related keyword groups rather than individual tokens, substantially reducing computational burden while preserving interpretability. We employed FinBERT to embed news articles from 2015 to 2024, clustered them into coherent semantic groups, and applied GroupSHAP to measure each group's contribution to stock price movements. The resulting group-level SHAP variables across multiple topics were used as input features for the prediction model. Empirical results from one-day-ahead forecasting of the S&P 500 index throughout 2024 demonstrate that our approach achieves a 32.2% reduction in MAE and a 40.5% reduction in RMSE compared with benchmark models without the GroupSHAP mechanism. This research presents the first application of GroupSHAP in news-driven financial forecasting, showing that grouped sentiment representations simultaneously enhance interpretability and predictive performance.
comment: 6 pages
♻ ☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.
comment: 21 pages, 2 figures, preliminary version
♻ ☆ Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets
Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.
comment: Submitted to Computer Science Journal. Version with updated acknowledgments
♻ ☆ How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison
The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk's xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce "truthful" entries via the large language model Grok. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study undertakes a large-scale computational comparison of 1,800 matched article pairs between Grokipedia and Wikipedia, drawn from the 2,000 most-edited Wikipedia pages. Using metrics across lexical richness, readability, structural organization, reference density, and semantic similarity, we assess how closely the two platforms align in form and substance. The results show that while Grokipedia exhibits strong semantic and stylistic alignment with Wikipedia, it typically produces longer but less lexically diverse articles, with fewer references per word and greater structural variability. These findings suggest that AI-generated encyclopedic content currently mirrors Wikipedia's informational scope but diverges in editorial norms, favoring narrative expansion over citation-based verification. The implications highlight new tensions around transparency, provenance, and the governance of knowledge in an era of automated text generation.
comment: 13 pages, 5 figures, 2 tables, updated with larger sample size of 2000 articles, better text cleaning proceedure
♻ ☆ New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models LREC
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.
comment: under review @LREC
♻ ☆ JudgeLRM: Large Reasoning Models as a Judge
Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.
comment: Preprint
♻ ☆ Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time ICLR 2026
Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
comment: 40 pages, 22 figures. Under review at ICLR 2026
♻ ☆ In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge
Large Language Models (LLMs) can be understood as Collective Knowledge (CK): a condensation of human cultural and technical output, whose apparent intelligence emerges in dialogue. This perspective article, drawing on extended interaction with ChatGPT-4, postulates differential response modes that plausibly trace their origin to distinct model subnetworks. It argues that CK has no persistent internal state or ``spine'': it drifts, it complies, and its behaviour is shaped by the user and by fine-tuning. It develops the notion of co-augmentation, in which human judgement and CK's representational reach jointly produce forms of analysis that neither could generate alone. Finally, it suggests that CK offers a tractable object for neuroscience: unlike biological brains, these systems expose their architecture, training history, and activation dynamics, making the human--CK loop itself an experimental target.
comment: 7 pages, 1 table
♻ ☆ AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24
Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.
comment: (First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)
♻ ☆ Open Agent Specification (Agent Spec) Technical Report
Open Agent Specification (Agent Spec) is a declarative language for defining AI agents and workflows in a way that is compatible across different AI frameworks, promoting portability and interoperability within AI Agent frameworks. Agent Spec aims to resolve the challenges of fragmented agent development by providing a common unified specification that allows AI agents to be designed once and deployed across various frameworks, improving interoperability and reusability, while reducing redundant efforts. Additionally, Agent Spec facilitates development tools and portability, allowing AI agents to be defined independently of their execution environment and enabling teams to exchange solutions without implementation-specific limitations. Agent Spec benefits four key groups: (i) Agent developers, who gain a superset of reusable components and design patterns, enabling them to leverage a broader range of functionalities; (ii) Agent framework and tool developers, who can use Agent Spec as an interchange format and therefore benefit from cross-framework and tool support; (iii) Researchers, who can achieve reproducible results and comparability, facilitating more reliable and consistent outcomes; (iv) Enterprises, which see faster prototype-to-deployment, increased productivity, and greater scalability and maintainability for their AI agent solutions. This technical report provides an overview of the technical foundations of Agent Spec, including motivation, benefits, and future work. We also introduce a standardized Evaluation harness to assess agent behavior and agentic workflows across runtimes (LangGraph, CrewAI, AutoGen, and WayFlow), using three different benchmarks (SimpleQA Verified, $\tau^2$-Bench and BIRD-SQL) - analogous to how HELM and related harnesses standardized LLM evaluation - so that performance, robustness, and efficiency can be compared consistently across frameworks.
♻ ☆ UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
comment: Accepted to RSS 2025. Code is available at https://github.com/OpenDriveLab/UniVLA
♻ ☆ The Limits of AI Explainability: An Algorithmic Information Theory Approach
This paper establishes a theoretical foundation for understanding the fundamental limits of AI explainability through algorithmic information theory. We formalize explainability as the approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity using Kolmogorov complexity. Our key theoretical contributions include: (1) a complexity gap theorem proving that any explanation significantly simpler than the original model must differ from it on some inputs; (2) precise bounds showing that explanation complexity grows exponentially with input dimension but polynomially with error tolerance for Lipschitz functions; and (3) a characterization of the gap between local and global explainability, demonstrating that local explanations can be significantly simpler while maintaining accuracy in relevant regions. We further establish a regulatory impossibility theorem proving that no governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error. These results highlight considerations likely to be relevant to the design, evaluation, and oversight of explainable AI systems.
♻ ☆ ConTextTab: A Semantics-Aware Tabular In-Context Learner NeurIPS 2025
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.
comment: Accepted as spotlight at NeurIPS 2025
♻ ☆ 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
comment: 14 pages, 1 figure, 7 tables
♻ ☆ Runtime Analysis of Evolutionary Algorithms for Multi-party Multi-objective Optimization
In scenarios where multiple decision-makers operate within a common decision space, each focusing on their own multi-objective optimization problem (e.g., bargaining games), the problem can be modeled as a multi-party multi-objective optimization problem (MPMOP). While numerous evolutionary algorithms have been proposed to solve MPMOPs, most results remain empirical. This paper presents the first theoretical analysis of the expected runtime of evolutionary algorithms on bi-party multi-objective optimization problems (BPMOPs). Our findings demonstrate that employing traditional multi-objective optimization algorithms to solve MPMOPs is both time-consuming and inefficient, as the resulting population contains many solutions that fail to achieve consensus among decision-makers. An alternative approach involves decision-makers individually solving their respective optimization problems and seeking consensus only in the final stage. While feasible for pseudo-Boolean optimization problems, this method may fail to guarantee approximate performance for one party in NP-hard problems. Finally, we propose evolutionary multi-party multi-objective optimizers (EMPMO) for pseudo-Boolean optimization and shortest path problems within a multi-party multi-objective context, maintain a common solution set among all parties. Theoretical and experimental results demonstrate that the proposed \( \text{EMPMO}_{\text{random}} \) outperforms previous algorithms in terms of the lower bound on the expected runtime for pseudo-Boolean optimization problems. Additionally, the consensus-based evolutionary multi-party multi-objective optimizer( \( \text{EMPMO}_{\text{cons}}^{\text{SP}} \) ) achieves better efficiency and precision in solving shortest path problems compared to existing algorithms.
♻ ☆ InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
♻ ☆ Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training
Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
♻ ☆ Where to Search: Measure the Prior-Structured Search Space of LLM Agents
The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via two instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.
comment: 11 pages, 4 figures, 1 table
♻ ☆ Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
♻ ☆ Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models
What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.
♻ ☆ Low-Rank Adaptation for Foundation Models: A Comprehensive Review
The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.
♻ ☆ Memory Assisted LLM for Personalized Recommendation System
Large language models (LLMs) have demonstrated significant potential in solving recommendation tasks. With proven capabilities in understanding user preferences, LLM personalization has emerged as a critical area for providing tailored responses to individuals. Current studies explore personalization through prompt design and fine-tuning, paving the way for further research in personalized LLMs. However, existing approaches are either costly and inefficient in capturing diverse user preferences or fail to account for timely updates to user history. To address these gaps, we propose the Memory-Assisted Personalized LLM (MAP). Through user interactions, we first create a history profile for each user, capturing their preferences, such as ratings for historical items. During recommendation, we extract relevant memory based on similarity, which is then incorporated into the prompts to enhance personalized recommendations. In our experiments, we define a new task that enables testing with varying memory size under two scenarios: single domain where memory and tasks are from the same category and cross-domain (e.g. memory from movies and recommendation tasks in books). The results show that MAP outperforms regular LLM-based recommenders that integrate user history directly through prompt design. Moreover, as user history grows, MAP's advantage increases in both scenarios, making it more suitable for addressing successive personalized user requests.
comment: 8 pages, 7 figures
♻ ☆ A Self-Evolving AI Agent System for Climate Science
Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni\~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.
♻ ☆ Generative AI and Empirical Software Engineering: A Paradigm Shift
The adoption of large language models (LLMs) and autonomous agents in software engineering marks an enduring paradigm shift. These systems create new opportunities for tool design, workflow orchestration, and empirical observation, while fundamentally reshaping the roles of developers and the artifacts they produce. Although traditional empirical methods remain central to software engineering research, the rapid evolution of AI introduces new data modalities, alters causal assumptions, and challenges foundational constructs such as "developer", "artifact", and "interaction". As humans and AI agents increasingly co-create, the boundaries between social and technical actors blur, and the reproducibility of findings becomes contingent on model updates and prompt contexts. This vision paper examines how the integration of LLMs into software engineering disrupts established research paradigms. We discuss how it transforms the phenomena we study, the methods and theories we rely on, the data we analyze, and the threats to validity that arise in dynamic AI-mediated environments. Our aim is to help the empirical software engineering community adapt its questions, instruments, and validation standards to a future in which AI systems are not merely tools, but active collaborators shaping software engineering and its study.
comment: Published at 2nd IEEE/ACM International Conference on AI-powered Software (AIware 2025)
♻ ☆ Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning EMNLP 2025
Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.
comment: EMNLP 2025 Findings
♻ ☆ Localist LLMs -- A Mathematical Framework for Dynamic Locality Control
We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
♻ ☆ Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding NeurIPS 2025
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
comment: Accepted to NeurIPS 2025
♻ ☆ MARFT: Multi-Agent Reinforcement Fine-Tuning
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
comment: 42 pages
♻ ☆ MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at https://github.com/mylittleriver/MedREK.
comment: Preprint, work in progress
♻ ☆ Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
The reasoning capabilities of large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) rationales from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique tasks: (i) critiquing the CoT rationales according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Building on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.
comment: emnlp 2025 main conference
♻ ☆ Limits of Safe AI Deployment: Differentiating Oversight and Control
Oversight and control, which we collectively call supervision, are often discussed as ways to ensure that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the requirements for "human oversight" risk codifying vague or inconsistent interpretations of key concepts like oversight and control. This ambiguous terminology could undermine efforts to design or evaluate systems that must operate under meaningful human supervision. This matters because the term is used by regulatory texts such as the EU AI Act. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We next differentiate control as ex-ante or real-time and operational rather than policy or governance, and oversight as performed ex-post, or a policy and governance function. Control aims to prevent failures, while oversight focuses on detection, remediation, or incentives for future prevention. Building on this, we make three contributions. 1) We propose a framework to align regulatory expectations with what is technically and organizationally plausible, articulating the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. 2) We outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. 3) We explicitly highlight boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present and future limitations.
comment: Revised to improve table formatting and update draft
♻ ☆ Language-Driven Coordination and Learning in Multi-Agent Simulation Environments
This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.
♻ ☆ DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ
comment: Accepted to ACM TOMM (2025)
♻ ☆ Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.
comment: Preprint submitted to Aerospace Science and Technology (Elsevier) for possible publication
♻ ☆ Epistemic Uncertainty for Generated Image Detection NeurIPS 2025
We introduce a novel framework for AI-generated image detection through epistemic uncertainty, aiming to address critical security concerns in the era of generative models. Our key insight stems from the observation that distributional discrepancies between training and testing data manifest distinctively in the epistemic uncertainty space of machine learning models. In this context, the distribution shift between natural and generated images leads to elevated epistemic uncertainty in models trained on natural images when evaluating generated ones. Hence, we exploit this phenomenon by using epistemic uncertainty as a proxy for detecting generated images. This converts the challenge of generated image detection into the problem of uncertainty estimation, underscoring the generalization performance of the model used for uncertainty estimation. Fortunately, advanced large-scale vision models pre-trained on extensive natural images have shown excellent generalization performance for various scenarios. Thus, we utilize these pre-trained models to estimate the epistemic uncertainty of images and flag those with high uncertainty as generated. Extensive experiments demonstrate the efficacy of our method. Code is available at https://github.com/tmlr-group/WePe.
comment: 28 pages, 10 figures, NeurIPS 2025
♻ ☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
♻ ☆ Multi-Focused Video Group Activities Hashing
With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.
♻ ☆ Complex QA and language models hybrid architectures, Survey
This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.
♻ ☆ EgoBlind: Towards Egocentric Visual Assistance for the Blind NeurIPS'25
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.
comment: NeurIPS'25 (D&B Track)
♻ ☆ Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
comment: Accepted for publication in IEEE Transactions on Medical Imaging (IEEE TMI)
♻ ☆ Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness NeurIPS 2025
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.
comment: Accepted for the proceedings of 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Computational Basis of LLM's Decision Making in Social Simulation
Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game -- a classic behavioral experiment on fairness and prosocial behavior. We extract "vectors of variable variations" (e.g., "male" to "female") from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.
♻ ☆ Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation
Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.
♻ ☆ Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
comment: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabels
♻ ☆ Prevailing Research Areas for Music AI in the Era of Foundation Models
Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
♻ ☆ The Digital Ecosystem of Beliefs: does evolution favour AI over humans?
As AI systems are integrated into social networks, there are AI safety concerns that AI-generated content may dominate the web, e.g. in popularity or impact on beliefs. To understand such questions, this paper proposes the Digital Ecosystem of Beliefs (Digico), the first evolutionary framework for controlled experimentation with multi-population interactions in simulated social networks. Following a Universal Darwinism approach, the framework models a population of agents which change their messaging strategies due to evolutionary updates. They interact via messages, update their beliefs following a contagion model, and maintain their beliefs through cognitive Lamarckian inheritance. Initial experiments with Digico implement two types of agents, which are modelled to represent AIs vs humans based on higher rates of communication, higher rates of evolution, seeding fixed beliefs with propaganda aims, and higher influence on the recommendation algorithm. These experiments show that: a) when AIs have faster messaging, evolution, and more influence on the recommendation algorithm, they get 80% to 95% of the views; b) AIs designed for propaganda can typically convince 50% of humans to adopt extreme beliefs, and up to 85% when agents believe only a limited number of channels; c) a penalty for content that violates agents' beliefs reduces propaganda effectiveness up to 8%. We further discuss Digico as a tool for systematic experimentation across multi-agent configurations, the implications for legislation, personal use, and platform design, and the use of Digico for studying evolutionary principles.
♻ ☆ DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
comment: 14 pages
♻ ☆ DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.
comment: 14 pages, 4 figures, preprint
♻ ☆ Mapping Overlaps in Benchmarks through Perplexity in the Wild
We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.
♻ ☆ From Superficial Outputs to Superficial Learning: Risks of Large Language Models in Education
Large Language Models (LLMs) are transforming education by enabling personalization, feedback, and knowledge access, while also raising concerns about risks to students and learning systems. Yet empirical evidence on these risks remains fragmented. This paper presents a systematic review of 70 empirical studies across computer science, education, and psychology. Guided by four research questions, we examine: (i) which applications of LLMs in education have been most frequently explored; (ii) how researchers have measured their impact; (iii) which risks stem from such applications; and (iv) what mitigation strategies have been proposed. We find that research on LLMs clusters around three domains: operational effectiveness, personalized applications, and interactive learning tools. Across these, model-level risks include superficial understanding, bias, limited robustness, anthropomorphism, hallucinations, privacy concerns, and knowledge constraints. When learners interact with LLMs, these risks extend to cognitive and behavioural outcomes, including reduced neural activity, over-reliance, diminished independent learning skills, and a loss of student agency. To capture this progression, we propose an LLM-Risk Adapted Learning Model that illustrates how technical risks cascade through interaction and interpretation to shape educational outcomes. As the first synthesis of empirically assessed risks, this review provides a foundation for responsible, human-centred integration of LLMs in education.
♻ ☆ Neighboring State-based Exploration for Reinforcement Learning
Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49% in terms of Eval Reward Return.
♻ ☆ Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.
♻ ☆ OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting
Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.
♻ ☆ MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation EMNLP 2025
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST
comment: EMNLP 2025
♻ ☆ FTSmartAudit: A Knowledge Distillation-Enhanced Framework for Automated Smart Contract Auditing Using Fine-Tuned LLMs
The rapid growth of blockchain technology has driven the widespread adoption of smart contracts. However, their inherent vulnerabilities have led to significant financial losses. Traditional auditing methods, while essential, struggle to keep pace with the increasing complexity and scale of smart contracts. Large Language Models (LLMs) offer promising capabilities for automating vulnerability detection, but their adoption is often limited by high computational costs. Although prior work has explored leveraging large models through agents or workflows, relatively little attention has been given to improving the performance of smaller, fine-tuned models--a critical factor for achieving both efficiency and data privacy. In this paper, we introduce HKT-SmartAudit, a framework for developing lightweight models optimized for smart contract auditing. It features a multi-stage knowledge distillation pipeline that integrates classical distillation, external domain knowledge, and reward-guided learning to transfer high-quality insights from large teacher models. A single-task learning strategy is employed to train compact student models that maintain high accuracy and robustness while significantly reducing computational overhead. Experimental results show that our distilled models outperform both commercial tools and larger models in detecting complex vulnerabilities and logical flaws, offering a practical, secure, and scalable solution for smart contract auditing. The source code is available at Github repository.
comment: 18 pages, 7 figures
♻ ☆ A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications NeurIPS 2025
The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
comment: This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Trustworthy AI Must Account for Interactions ICLR 2025
Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.
comment: Presented at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment
♻ ☆ Dataset Distillation for Offline Reinforcement Learning ICML 2024
Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .
comment: ICML 2024 DMLR Workshop Our project site is available at https://datasetdistillation4rl.github.io We also provide our implementation at https://github.com/ggflow123/DDRL
♻ ☆ Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models NeurIPS 2025
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
comment: Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/
♻ ☆ Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models
Large Language Models are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained decoding and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial LLM deployment.
Machine Learning 162
☆ Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.
comment: 11 pages, 14 figures
☆ Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning NeurIPS 2025
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
comment: Accepted at Efficient Reasoning Workshop at NeurIPS 2025
☆ Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits NeurIPS 2025
Variance-dependent regret bounds have received increasing attention in recent studies on contextual bandits. However, most of these studies are focused on upper confidence bound (UCB)-based bandit algorithms, while sampling based bandit algorithms such as Thompson sampling are still understudied. The only exception is the LinVDTS algorithm (Xu et al., 2023), which is limited to linear reward function and its regret bound is not optimal with respect to the model dimension. In this paper, we present FGTSVA, a variance-aware Thompson Sampling algorithm for contextual bandits with general reward function with optimal regret bound. At the core of our analysis is an extension of the decoupling coefficient, a technique commonly used in the analysis of Feel-good Thompson sampling (FGTS) that reflects the complexity of the model space. With the new decoupling coefficient denoted by $\mathrm{dc}$, FGTS-VA achieves the regret of $\tilde{O}(\sqrt{\mathrm{dc}\cdot\log|\mathcal{F}|\sum_{t=1}^T\sigma_t^2}+\mathrm{dc})$, where $|\mathcal{F}|$ is the size of the model space, $T$ is the total number of rounds, and $\sigma_t^2$ is the subgaussian norm of the noise (e.g., variance when the noise is Gaussian) at round $t$. In the setting of contextual linear bandits, the regret bound of FGTSVA matches that of UCB-based algorithms using weighted linear regression (Zhou and Gu, 2022).
comment: 19 pages, 2 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape
In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.
☆ Measuring the Intrinsic Dimension of Earth Representations
Within the context of representation learning for Earth observation, geographic Implicit Neural Representations (INRs) embed low-dimensional location inputs (longitude, latitude) into high-dimensional embeddings, through models trained on geo-referenced satellite, image or text data. Despite the common aim of geographic INRs to distill Earth's data into compact, learning-friendly representations, we lack an understanding of how much information is contained in these Earth representations, and where that information is concentrated. The intrinsic dimension of a dataset measures the number of degrees of freedom required to capture its local variability, regardless of the ambient high-dimensional space in which it is embedded. This work provides the first study of the intrinsic dimensionality of geographic INRs. Analyzing INRs with ambient dimension between 256 and 512, we find that their intrinsic dimensions fall roughly between 2 and 10 and are sensitive to changing spatial resolution and input modalities during INR pre-training. Furthermore, we show that the intrinsic dimension of a geographic INR correlates with downstream task performance and can capture spatial artifacts, facilitating model evaluation and diagnostics. More broadly, our work offers an architecture-agnostic, label-free metric of information content that can enable unsupervised evaluation, model selection, and pre-training design across INRs.
comment: Pre-print. 27 pages, 11 figures, 6 tables
☆ Geometric Data Valuation via Leverage Scores NeurIPS 2025
Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint's structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(\varepsilon)$ of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.
comment: MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making (NeurIPS 2025)
☆ Uncertainty Guided Online Ensemble for Non-stationary Data Streams in Fusion Science
Machine Learning (ML) is poised to play a pivotal role in the development and operation of next-generation fusion devices. Fusion data shows non-stationary behavior with distribution drifts, resulted by both experimental evolution and machine wear-and-tear. ML models assume stationary distribution and fail to maintain performance when encountered with such non-stationary data streams. Online learning techniques have been leveraged in other domains, however it has been largely unexplored for fusion applications. In this paper, we present an application of online learning to continuously adapt to drifting data stream for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion facility. The results demonstrate that online learning is critical to maintain ML model performance and reduces error by 80% compared to a static model. Moreover, traditional online learning can suffer from short-term performance degradation as ground truth is not available before making the predictions. As such, we propose an uncertainty guided online ensemble method to further improve the performance. The Deep Gaussian Process Approximation (DGPA) technique is leveraged for calibrated uncertainty estimation and the uncertainty values are then used to guide a meta-algorithm that produces predictions based on an ensemble of learners trained on different horizon of historical data. The DGPA also provides uncertainty estimation along with the predictions for decision makers. The online ensemble and the proposed uncertainty guided online ensemble reduces predictions error by about 6%, and 10% respectively over standard single model based online learning.
comment: 24 pages including total of references, 2 appendices, 7 Figures (5 in main article, 2 in appendix A)
☆ Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling
The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling--which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.
comment: 13 pages, 3 figures, under review for World Modeling Workshop 2026
☆ LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS NeurIPS 2025
Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.
comment: Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025
☆ Energy Loss Functions for Physical Systems NeurIPS 2025
Effectively leveraging prior knowledge of a system's physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.
comment: 10 pages, 4 figures, NeurIPS 2025
☆ Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models NeurIPS 2025
Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.
comment: 7 pages, NeurIPS 2025 Efficient Reasoning Workshop
☆ Data-driven Learning of Interaction Laws in Multispecies Particle Systems with Gaussian Processes: Convergence Theory and Applications
We develop a Gaussian process framework for learning interaction kernels in multi-species interacting particle systems from trajectory data. Such systems provide a canonical setting for multiscale modeling, where simple microscopic interaction rules generate complex macroscopic behaviors. While our earlier work established a Gaussian process approach and convergence theory for single-species systems, and later extended to second-order models with alignment and energy-type interactions, the multi-species setting introduces new challenges: heterogeneous populations interact both within and across species, the number of unknown kernels grows, and asymmetric interactions such as predator-prey dynamics must be accommodated. We formulate the learning problem in a nonparametric Bayesian setting and establish rigorous statistical guarantees. Our analysis shows recoverability of the interaction kernels, provides quantitative error bounds, and proves statistical optimality of posterior estimators, thereby unifying and generalizing previous single-species theory. Numerical experiments confirm the theoretical predictions and demonstrate the effectiveness of the proposed approach, highlighting its advantages over existing kernel-based methods. This work contributes a complete statistical framework for data-driven inference of interaction laws in multi-species systems, advancing the broader multiscale modeling program of connecting microscopic particle dynamics with emergent macroscopic behavior.
comment: 40 pages, Appendix 17 pages
☆ Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet
We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in Onet.pl, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.
☆ Finding Probably Approximate Optimal Solutions by Training to Estimate the Optimal Values of Subproblems
The paper is about developing a solver for maximizing a real-valued function of binary variables. The solver relies on an algorithm that estimates the optimal objective-function value of instances from the underlying distribution of objectives and their respective sub-instances. The training of the estimator is based on an inequality that facilitates the use of the expected total deviation from optimality conditions as a loss function rather than the objective-function itself. Thus, it does not calculate values of policies, nor does it rely on solved instances.
☆ A Dual-Use Framework for Clinical Gait Analysis: Attention-Based Sensor Optimization and Automated Dataset Auditing
Objective gait analysis using wearable sensors and AI is critical for managing neurological and orthopedic conditions. However, models are vulnerable to hidden dataset biases, and task-specific sensor optimization remains a challenge. We propose a multi-stream attention-based deep learning framework that functions as both a sensor optimizer and an automated data auditor. Applied to the Voisard et al. (2025) multi-cohort gait dataset on four clinical tasks (PD, OA, CVA screening; PD vs CVA differential), the model's attention mechanism quantitatively discovered a severe dataset confound. For OA and CVA screening, tasks where bilateral assessment is clinically essential, the model assigned more than 70 percent attention to the Right Foot while statistically ignoring the Left Foot (less than 0.1 percent attention, 95 percent CI [0.0-0.1]). This was not a clinical finding but a direct reflection of a severe laterality bias (for example, 15 of 15 right-sided OA) in the public dataset. The primary contribution of this work is methodological, demonstrating that an interpretable framework can automatically audit dataset integrity. As a secondary finding, the model proposes novel, data-driven sensor synergies (for example, Head plus Foot for PD screening) as hypotheses for future optimized protocols.
☆ Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.
☆ Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.
comment: Submitted to MLSys 2026
☆ Quantum-Enhanced Generative Models for Rare Event Prediction
Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.
comment: IEEE Conference COMCOMAP 2025
☆ Predicting Microbial Interactions Using Graph Neural Networks NeurIPS 2025
Predicting interspecies interactions is a key challenge in microbial ecology, as these interactions are critical to determining the structure and activity of microbial communities. In this work, we used data on monoculture growth capabilities, interactions with other species, and phylogeny to predict a negative or positive effect of interactions. More precisely, we used one of the largest available pairwise interaction datasets to train our models, comprising over 7,500 interactions be- tween 20 species from two taxonomic groups co-cultured under 40 distinct carbon conditions, with a primary focus on the work of Nestor et al.[28 ]. In this work, we propose Graph Neural Networks (GNNs) as a powerful classifier to predict the direction of the effect. We construct edge-graphs of pairwise microbial interactions in order to leverage shared information across individual co-culture experiments, and use GNNs to predict modes of interaction. Our model can not only predict binary interactions (positive/negative) but also classify more complex interaction types such as mutualism, competition, and parasitism. Our initial results were encouraging, achieving an F1-score of 80.44%. This significantly outperforms comparable methods in the literature, including conventional Extreme Gradient Boosting (XGBoost) models, which reported an F1-score of 72.76%.
comment: 9 pages, 3 figures, NeurIPS 2025 Workshop New Perspectives in Graph Machine Learning
☆ RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients
Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%
comment: 9 pages
☆ Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance
Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number $>10^{11}$. It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.
comment: Under review, IEEE Letters
☆ Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
☆ TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
comment: 9 pages, 6 figures, 5 tables
☆ Bulk-boundary decomposition of neural networks
We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.
comment: 6 pages, 2 figures
☆ SEAL - A Symmetry EncourAging Loss for High Energy Physics
Physical symmetries provide a strong inductive bias for constructing functions to analyze data. In particular, this bias may improve robustness, data efficiency, and interpretability of machine learning models. However, building machine learning models that explicitly respect symmetries can be difficult due to the dedicated components required. Moreover, real-world experiments may not exactly respect fundamental symmetries at the level of finite granularities and energy thresholds. In this work, we explore an alternative approach to create symmetry-aware machine learning models. We introduce soft constraints that allow the model to decide the importance of added symmetries during the learning process instead of enforcing exact symmetries. We investigate two complementary approaches, one that penalizes the model based on specific transformations of the inputs and one inspired by group theory and infinitesimal transformations of the inputs. Using top quark jet tagging and Lorentz equivariance as examples, we observe that the addition of the soft constraints leads to more robust performance while requiring negligible changes to current state-of-the-art models.
☆ Stability of mixed-state phases under weak decoherence
We prove that the Gibbs states of classical, and commuting-Pauli, Hamiltonians are stable under weak local decoherence: i.e., we show that the effect of the decoherence can be locally reversed. In particular, our conclusions apply to finite-temperature equilibrium critical points and ordered low-temperature phases. In these systems the unconditional spatio-temporal correlations are long-range, and local (e.g., Metropolis) dynamics exhibits critical slowing down. Nevertheless, our results imply the existence of local "decoders" that undo the decoherence, when the decoherence strength is below a critical value. An implication of these results is that thermally stable quantum memories have a threshold against decoherence that remains nonzero as one approaches the critical temperature. Analogously, in diffusion models, stability of data distributions implies the existence of computationally-efficent local denoisers in the late-time generation dynamics.
comment: 25 pages, 3 figures
☆ Coordinate ascent neural Kalman-MLE for state estimation
This paper presents a coordinate ascent algorithm to learn dynamic and measurement models in dynamic state estimation using maximum likelihood estimation in a supervised manner. In particular, the dynamic and measurement models are assumed to be Gaussian and the algorithm learns the neural network parameters that model the dynamic and measurement functions, and also the noise covariance matrices. The trained dynamic and measurement models are then used with a non-linear Kalman filter algorithm to estimate the state during the testing phase.
☆ Proximal Regret and Proximal Correlated Equilibria: A New Tractable Solution Concept for Online Learning and Games NeurIPS
Learning and computation of equilibria are central problems in algorithmic game theory. In this work, we introduce proximal regret, a new notion of regret based on proximal operators that lies strictly between external and swap regret. When every player employs a no-proximal-regret algorithm in a general convex game, the empirical distribution of play converges to proximal correlated equilibria (PCE), a refinement of coarse correlated equilibria. Our framework unifies several emerging notions in online learning and game theory -- such as gradient equilibrium and semicoarse correlated equilibrium -- and introduces new ones. Our main result shows that the classic Online Gradient Descent (GD) algorithm achieves an optimal $O(\sqrt{T})$ bound on proximal regret, revealing that GD, without modification, minimizes a stronger regret notion than external regret. This provides a new explanation for the empirically superior performance of gradient descent in online learning and games. We further extend our analysis to Mirror Descent in the Bregman setting and to Optimistic Gradient Descent, which yields faster convergence in smooth convex games.
comment: This paper presents proximal regret and proximal correlated equilibria results that do not appear in the NeurIPS version of arXiv:2403.08171
☆ Bridging Lifelong and Multi-Task Representation Learning via Algorithm and Complexity Measure
In lifelong learning, a learner faces a sequence of tasks with shared structure and aims to identify and leverage it to accelerate learning. We study the setting where such structure is captured by a common representation of data. Unlike multi-task learning or learning-to-learn, where tasks are available upfront to learn the representation, lifelong learning requires the learner to make use of its existing knowledge while continually gathering partial information in an online fashion. In this paper, we consider a generalized framework of lifelong representation learning. We propose a simple algorithm that uses multi-task empirical risk minimization as a subroutine and establish a sample complexity bound based on a new notion we introduce--the task-eluder dimension. Our result applies to a wide range of learning problems involving general function classes. As concrete examples, we instantiate our result on classification and regression tasks under noise.
☆ Interpretable Machine Learning for Reservoir Water Temperatures in the U.S. Red River Basin of the South
Accurate prediction of Reservoir Water Temperature (RWT) is vital for sustainable water management, ecosystem health, and climate resilience. Yet, prediction alone offers limited insight into the governing physical processes. To bridge this gap, we integrated explainable machine learning (ML) with symbolic modeling to uncover the drivers of RWT dynamics across ten reservoirs in the Red River Basin, USA, using over 10,000 depth-resolved temperature profiles. We first employed ensemble and neural models, including Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Multilayer Perceptron (MLP), achieving high predictive skill (best RMSE = 1.20 degree Celsius, R^2 = 0.97). Using SHAP (SHapley Additive exPlanations), we quantified the contribution of physical drivers such as air temperature, depth, wind, and lake volume, revealing consistent patterns across reservoirs. To translate these data-driven insights into compact analytical expressions, we developed Kolmogorov Arnold Networks (KANs) to symbolically approximate RWT. Ten progressively complex KAN equations were derived, improving from R^2 = 0.84 using a single predictor (7-day antecedent air temperature) to R^2 = 0.92 with ten predictors, though gains diminished beyond five, highlighting a balance between simplicity and accuracy. The resulting equations, dominated by linear and rational forms, incrementally captured nonlinear behavior while preserving interpretability. Depth consistently emerged as a secondary but critical predictor, whereas precipitation had limited effect. By coupling predictive accuracy with explanatory power, this framework demonstrates how KANs and explainable ML can transform black-box models into transparent surrogates that advance both prediction and understanding of reservoir thermal dynamics.
☆ Priors in Time: Missing Inductive Biases for Language Model Interpretability
Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
comment: Preprint
☆ Towards Multi-Fidelity Scaling Laws of Neural Surrogates in CFD
Scaling laws describe how model performance grows with data, parameters and compute. While large datasets can usually be collected at relatively low cost in domains such as language or vision, scientific machine learning is often limited by the high expense of generating training data through numerical simulations. However, by adjusting modeling assumptions and approximations, simulation fidelity can be traded for computational cost, an aspect absent in other domains. We investigate this trade-off between data fidelity and cost in neural surrogates using low- and high-fidelity Reynolds-Averaged Navier-Stokes (RANS) simulations. Reformulating classical scaling laws, we decompose the dataset axis into compute budget and dataset composition. Our experiments reveal compute-performance scaling behavior and exhibit budget-dependent optimal fidelity mixes for the given dataset configuration. These findings provide the first study of empirical scaling laws for multi-fidelity neural surrogate datasets and offer practical considerations for compute-efficient dataset generation in scientific machine learning.
☆ Simulating Environments with Reasoning Models for Agent Training
LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or APIs. Inspired by this capability, we propose two frameworks: Simia-SFT, a pipeline that synthesizes SFT data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner, and Simia-RL, a framework that enables RL training without real environment implementations through LLM-simulated feedback. Fine-tuning open models yields consistent improvements across multiple benchmarks, surpassing GPT-4o and approaching o4-mini on $\tau^2$-Bench. Together, Simia-SFT and Simia-RL enable scalable agent training without environment engineering, replacing heavy and brittle implementations with flexible LLM-based simulation.
☆ Machine and Deep Learning for Indoor UWB Jammer Localization
Ultra-wideband (UWB) localization delivers centimeter-scale accuracy but is vulnerable to jamming attacks, creating security risks for asset tracking and intrusion detection in smart buildings. Although machine learning (ML) and deep learning (DL) methods have improved tag localization, localizing malicious jammers within a single room and across changing indoor layouts remains largely unexplored. Two novel UWB datasets, collected under original and modified room configurations, are introduced to establish comprehensive ML/DL baselines. Performance is rigorously evaluated using a variety of classification and regression metrics. On the source dataset with the collected UWB features, Random Forest achieves the highest F1-macro score of 0.95 and XGBoost achieves the lowest mean Euclidean error of 20.16 cm. However, deploying these source-trained models in the modified room layout led to severe performance degradation, with XGBoost's mean Euclidean error increasing tenfold to 207.99 cm, demonstrating significant domain shift. To mitigate this degradation, a domain-adversarial ConvNeXt autoencoder (A-CNT) is proposed that leverages a gradient-reversal layer to align CIR-derived features across domains. The A-CNT framework restores localization performance by reducing the mean Euclidean error to 34.67 cm. This represents a 77 percent improvement over non-adversarial transfer learning and an 83 percent improvement over the best baseline, restoring the fraction of samples within 30 cm to 0.56. Overall, the results demonstrate that adversarial feature alignment enables robust and transferable indoor jammer localization despite environmental changes. Code and dataset available at https://github.com/afbf4c8996f/Jammer-Loc
comment: Accepted at the 20th International Conference on Risks and Security of Internet and Systems (CRiSIS 2025, Gatineau-Canada, https://crisis2025.uqo.ca/). The paper will soon be published as post-proceedings in Springer's LNCS
☆ No-rank Tensor Decomposition Using Metric Learning
Tensor decomposition faces fundamental challenges in analyzing high-dimensional data, where traditional methods based on reconstruction and fixed-rank constraints often fail to capture semantically meaningful structures. This paper introduces a no-rank tensor decomposition framework grounded in metric learning, which replaces reconstruction objectives with a discriminative, similarity-based optimization. The proposed approach learns data-driven embeddings by optimizing a triplet loss with diversity and uniformity regularization, creating a feature space where distance directly reflects semantic similarity. We provide theoretical guarantees for the framework's convergence and establish bounds on its metric properties. Evaluations across diverse domains -- including face recognition (LFW, Olivetti), brain connectivity analysis (ABIDE), and simulated data (galaxy morphology, crystal structures) -- demonstrate that our method outperforms baseline techniques, including PCA, t-SNE, UMAP, and tensor decomposition baselines (CP and Tucker). Results show substantial improvements in clustering metrics (Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, Separation Ratio, Adjusted Rand Index, Normalized Mutual Information) and reveal a fundamental trade-off: while metric learning optimizes global class separation, it deliberately transforms local geometry to align with semantic relationships. Crucially, our approach achieves superior performance with smaller training datasets compared to transformer-based methods, offering an efficient alternative for domains with limited labeled data. This work establishes metric learning as a paradigm for tensor-based analysis, prioritizing semantic relevance over pixel-level fidelity while providing computational advantages in data-scarce scenarios.
☆ KV Cache Transform Coding for Compact Storage in LLM Inference
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
☆ Disciplined Biconvex Programming
We introduce disciplined biconvex programming (DBCP), a modeling framework for specifying and solving biconvex optimization problems. Biconvex optimization problems arise in various applications, including machine learning, signal processing, computational science, and control. Solving a biconvex optimization problem in practice usually resolves to heuristic methods based on alternate convex search (ACS), which iteratively optimizes over one block of variables while keeping the other fixed, so that the resulting subproblems are convex and can be efficiently solved. However, designing and implementing an ACS solver for a specific biconvex optimization problem usually requires significant effort from the user, which can be tedious and error-prone. DBCP extends the principles of disciplined convex programming to biconvex problems, allowing users to specify biconvex optimization problems in a natural way based on a small number of syntax rules. The resulting problem can then be automatically split and transformed into convex subproblems, for which a customized ACS solver is then generated and applied. DBCP allows users to quickly experiment with different biconvex problem formulations, without expertise in convex optimization. We implement DBCP into the open source Python package dbcp, as an extension to the famous domain specific language CVXPY for convex optimization.
☆ Dynamic Reconstruction of Ultrasound-Derived Flow Fields With Physics-Informed Neural Fields
Blood flow is sensitive to disease and provides insight into cardiac function, making flow field analysis valuable for diagnosis. However, while safer than radiation-based imaging and more suitable for patients with medical implants, ultrasound suffers from attenuation with depth, limiting the quality of the image. Despite advances in echocardiographic particle image velocimetry (EchoPIV), accurately measuring blood velocity remains challenging due to the technique's limitations and the complexity of blood flow dynamics. Physics-informed machine learning can enhance accuracy and robustness, particularly in scenarios where noisy or incomplete data challenge purely data-driven approaches. We present a physics-informed neural field model with multi-scale Fourier Feature encoding for estimating blood flow from sparse and noisy ultrasound data without requiring ground truth supervision. We demonstrate that this model achieves consistently low mean squared error in denoising and inpainting both synthetic and real datasets, verified against reference flow fields and ground truth flow rate measurements. While physics-informed neural fields have been widely used to reconstruct medical images, applications to medical flow reconstruction are mostly prominent in Flow MRI. In this work, we adapt methods that have proven effective in other imaging modalities to address the specific challenge of ultrasound-based flow reconstruction.
comment: 29 pages, 18 figures
☆ Bayesian Coreset Optimization for Personalized Federated Learning ICLR 2024
In a distributed machine learning setting like Federated Learning where there are multiple clients involved which update their individual weights to a single central server, often training on the entire individual client's dataset for each client becomes cumbersome. To address this issue we propose $\methodprop$: a personalized coreset weighted federated learning setup where the training updates for each individual clients are forwarded to the central server based on only individual client coreset based representative data points instead of the entire client data. Through theoretical analysis we present how the average generalization error is minimax optimal up to logarithm bounds (upper bounded by $\mathcal{O}(n_k^{-\frac{2 \beta}{2 \beta+\boldsymbol{\Lambda}}} \log ^{2 \delta^{\prime}}(n_k))$) and lower bounds of $\mathcal{O}(n_k^{-\frac{2 \beta}{2 \beta+\boldsymbol{\Lambda}}})$, and how the overall generalization error on the data likelihood differs from a vanilla Federated Learning setup as a closed form function ${\boldsymbol{\Im}}(\boldsymbol{w}, n_k)$ of the coreset weights $\boldsymbol{w}$ and coreset sample size $n_k$. Our experiments on different benchmark datasets based on a variety of recent personalized federated learning architectures show significant gains as compared to random sampling on the training data followed by federated learning, thereby indicating how intelligently selecting such training samples can help in performance. Additionally, through experiments on medical datasets our proposed method showcases some gains as compared to other submodular optimization based approaches used for subset selection on client's data.
comment: 9 pages, 5 figures, ICLR 2024
☆ Hybrid Neural Network-Based Indoor Localisation System for Mobile Robots Using CSI Data in a Robotics Simulator
We present a hybrid neural network model for inferring the position of mobile robots using Channel State Information (CSI) data from a Massive MIMO system. By leveraging an existing CSI dataset, our approach integrates a Convolutional Neural Network (CNN) with a Multilayer Perceptron (MLP) to form a Hybrid Neural Network (HyNN) that estimates 2D robot positions. CSI readings are converted into synthetic images using the TINTO tool. The localisation solution is integrated with a robotics simulator, and the Robot Operating System (ROS), which facilitates its evaluation through heterogeneous test cases, and the adoption of state estimators like Kalman filters. Our contributions illustrate the potential of our HyNN model in achieving precise indoor localisation and navigation for mobile robots in complex environments. The study follows, and proposes, a generalisable procedure applicable beyond the specific use case studied, making it adaptable to different scenarios and datasets.
comment: 13 pages, 7 figures. Conference paper (ROBOVIS 2025)
☆ Addressing prior dependence in hierarchical Bayesian modeling for PTA data analysis II: Noise and SGWB inference through parameter decorrelation
Pulsar Timing Arrays provide a powerful framework to measure low-frequency gravitational waves, but accuracy and robustness of the results are challenged by complex noise processes that must be accurately modeled. Standard PTA analyses assign fixed uniform noise priors to each pulsar, an approach that can introduce systematic biases when combining the array. To overcome this limitation, we adopt a hierarchical Bayesian modeling strategy in which noise priors are parametrized by higher-level hyperparameters. We further address the challenge posed by the correlations between hyperparameters and physical noise parameters, focusing on those describing red noise and dispersion measure variations. To decorrelate these quantities, we introduce an orthogonal reparametrization of the hierarchical model implemented with Normalizing Flows. We also employ i-nessai, a flow-guided nested sampler, to efficiently explore the resulting higher-dimensional parameter space. We apply our method to a minimal 3-pulsar case study, performing a simultaneous inference of noise and SGWB parameters. Despite the limited dataset, the results consistently show that the hierarchical treatment constrains the noise parameters more tightly and partially alleviates the red-noise-SGWB degeneracy, while the orthogonal reparametrization further enhances parameter independence without affecting the correlations intrinsic to the power-law modeling of the physical processes involved.
comment: 18 pages, 7 figures. Submitted to the Astronomy and Computing special issue HPC in Cosmology and Astrophysics
☆ Fractional Diffusion Bridge Models NeurIPS 2025
We present Fractional Diffusion Bridge Models (FDBM), a novel generative diffusion bridge framework driven by an approximation of the rich and non-Markovian fractional Brownian motion (fBM). Real stochastic processes exhibit a degree of memory effects (correlations in time), long-range dependencies, roughness and anomalous diffusion phenomena that are not captured in standard diffusion or bridge modeling due to the use of Brownian motion (BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM), we construct FDBM that enable tractable inference while preserving the non-Markovian nature of fBM. We prove the existence of a coupling-preserving generative diffusion bridge and leverage it for future state prediction from paired training data. We then extend our formulation to the Schr\"{o}dinger bridge problem and derive a principled loss function to learn the unpaired data translation. We evaluate FDBM on both tasks: predicting future protein conformations from aligned data, and unpaired image translation. In both settings, FDBM achieves superior performance compared to the Brownian baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$ atomic positions in protein structure prediction and lower Fr\'echet Inception Distance (FID) in unpaired image translation.
comment: To appear in NeurIPS 2025 proceedings. This version includes post-camera-ready revisions
☆ Random Initialization of Gated Sparse Adapters ICML 2025
When fine-tuning language models on new tasks, catastrophic forgetting -- performance degradation on previously-learned tasks -- is a ubiquitous problem. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this through low-rank adapters, sparse adaptation offers an alternative that doesn't impose rank constraints. We introduce Random Initialization of Gated Sparse Adapters (RIGSA), which starts from randomly-initialized full-rank adapters, gates them with a ReZero analog, and sparsifies them with iterative magnitude pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag, and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA and random masking. In spite of having more trainable parameters than QLoRA, the RIGSA configurations that we studied displayed less forgetting than QLoRA, particularly on GSM8k, though it performs comparably to random masking.
comment: 13 pages (8 main), 6 figures (4 main). Accepted by NewInML workshop @ ICML 2025 on June 27, 2025
☆ Improving Bayesian inference in PTA data analysis: importance nested sampling with Normalizing Flows
We present a detailed study of Bayesian inference workflows for pulsar timing array data with a focus on enhancing efficiency, robustness and speed through the use of normalizing flow-based nested sampling. Building on the Enterprise framework, we integrate the i-nessai sampler and benchmark its performance on realistic, simulated datasets. We analyze its computational scaling and stability, and show that it achieves accurate posteriors and reliable evidence estimates with substantially reduced runtime, by up to three orders of magnitude depending on the dataset configuration, with respect to conventional single-core parallel-tempering MCMC analyses. These results highlight the potential of flow-based nested sampling to accelerate PTA analyses while preserving the quality of the inference.
comment: 37 pages, 7 figures, 3 tables. Submitted to the Astronomy and Computing special issue HPC in Cosmology and Astrophysics
☆ ADNAC: Audio Denoiser using Neural Audio Codec
Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof-of-concept for adapting a state-of-the-art neural audio codec, the Descript Audio Codec (DAC), for music denoising. This work overcomes the limitations of traditional architectures like U-Nets by training the model on a large-scale, custom-synthesized dataset built from diverse sources. Training is guided by a multi objective loss function that combines time-domain, spectral, and signal-level fidelity metrics. Ultimately, this paper aims to present a PoC for high-fidelity, generative audio restoration.
comment: Accepted and presented at the 13th International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, October 19-22, 2025. 4 pages, 1 figure. IEEE Catalog Number: CFP2555H-USB, ISBN: 979-8-3315-7485-7
♻ ☆ GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
comment: 31 pages, 6 figures
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v4: fixed links in comments. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code and examples at https://tabarena.ai/code-examples and dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ Cold-Start Active Preference Learning in Socio-Economic Domains
Active preference learning offers an efficient approach to modeling preferences, but it is hindered by the cold-start problem, which leads to a marked decline in performance when no initial labeled data are available. While cold-start solutions have been proposed for domains such as vision and text, the cold-start problem in active preference learning remains largely unexplored, underscoring the need for practical, effective methods. Drawing inspiration from established practices in social and economic research, the proposed method initiates learning with a self-supervised phase that employs Principal Component Analysis (PCA) to generate initial pseudo-labels. This process produces a \say{warmed-up} model based solely on the data's intrinsic structure, without requiring expert input. The model is then refined through an active learning loop that strategically queries a simulated noisy oracle for labels. Experiments conducted on various socio-economic datasets, including those related to financial credibility, career success rate, and socio-economic status, consistently show that the PCA-driven approach outperforms standard active learning strategies that start without prior information. This work thus provides a computationally efficient and straightforward solution that effectively addresses the cold-start problem.
♻ ☆ RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
comment: 6 pages
♻ ☆ SimKey: A Semantically Aware Key Module for Watermarking Language Models
The rapid spread of text generated by large language models (LLMs) makes it increasingly difficult to distinguish authentic human writing from machine output. Watermarking offers a promising solution: model owners can embed an imperceptible signal into generated text, marking its origin. Most leading approaches seed an LLM's next-token sampling with a pseudo-random key that can later be recovered to identify the text as machine-generated, while only minimally altering the model's output distribution. However, these methods suffer from two related issues: (i) watermarks are brittle to simple surface-level edits such as paraphrasing or reordering; and (ii) adversaries can append unrelated, potentially harmful text that inherits the watermark, risking reputational damage to model owners. To address these issues, we introduce SimKey, a semantic key module that strengthens watermark robustness by tying key generation to the meaning of prior context. SimKey uses locality-sensitive hashing over semantic embeddings to ensure that paraphrased text yields the same watermark key, while unrelated or semantically shifted text produces a different one. Integrated with state-of-the-art watermarking schemes, SimKey improves watermark robustness to paraphrasing and translation while preventing harmful content from false attribution, establishing semantic-aware keying as a practical and extensible watermarking direction.
♻ ☆ Automotive Crash Dynamics Modeling Accelerated with Machine Learning
Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
♻ ☆ CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning NeurIPS 2025
Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app
comment: Accepted at NeurIPS 2025
♻ ☆ PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure
We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov-Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.
♻ ☆ MarsLGPR: Mars Rover Localization with Ground Penetrating Radar
In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at https://umfieldrobotics.github.io/marslgpr.
♻ ☆ AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.
comment: Accepted by IEEE TASLP 2025. Demopage: https://amphionspace.github.io/anyenhance. Open-source implementation: https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc
♻ ☆ Benchmarking LLMs in Web API Integration Tasks
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.
comment: To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track; switched to IEEE conference template
♻ ☆ Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning
Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting, but require data sharing, which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches regarding better load forecasting and reduced operational latency costs.
comment: Accepted at IEEE Internet of Things Journal, DOI: 10.1109/JIOT.2025.3586115
♻ ☆ Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity NeurIPS 2025
Out-of-distribution (OOD) detection is essential for ensuring the reliability and safety of machine learning systems. In recent years, it has received increasing attention, particularly through post-hoc detection and training-based methods. In this paper, we focus on post-hoc OOD detection, which enables identifying OOD samples without altering the model's training procedure or objective. Our primary goal is to investigate the relationship between model capacity and its OOD detection performance. Specifically, we aim to answer the following question: Does the Double Descent phenomenon manifest in post-hoc OOD detection? This question is crucial, as it can reveal whether overparameterization, which is already known to benefit generalization, can also enhance OOD detection. Despite the growing interest in these topics by the classic supervised machine learning community, this intersection remains unexplored for OOD detection. We empirically demonstrate that the Double Descent effect does indeed appear in post-hoc OOD detection. Furthermore, we provide theoretical insights to explain why this phenomenon emerges in such setting. Finally, we show that the overparameterized regime does not yield superior results consistently, and we propose a method to identify the optimal regime for OOD detection based on our observations.
comment: Accepted at NeurIPS 2025 (Conference on Neural Information Processing Systems)
♻ ☆ Combinatorial Creativity: A New Frontier in Generalization Abilities
Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
comment: Preprint. The first two authors contributed equally
♻ ☆ Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Optimal Point for Boundary Approximation}$ (OPBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the OPBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/OPBA_For_Counterfactuals
comment: This paper is 15 pages long consisting of multiple sections including an abstract, introduction, related works, methodology, results, ablation studies, conclusion, future works, and an appendix section. There are 10 figures and 5 tables in total
♻ ☆ OrbitChain: Orchestrating In-orbit Real-time Analytics of Earth Observation Data
Earth observation analytics have the potential to serve many time-sensitive applications. However, due to limited bandwidth and duration of ground-satellite connections, it takes hours or even days to download and analyze data from existing Earth observation satellites, making real-time demands like timely disaster response impossible. Toward real-time analytics, we introduce OrbitChain, a collaborative analytics framework that orchestrates computational resources across multiple satellites in an Earth observation constellation. OrbitChain decomposes analytics applications into microservices and allocates computational resources for time-constrained analysis. A traffic routing algorithm is devised to minimize the inter-satellite communication overhead. OrbitChain adopts a pipeline workflow that completes Earth observation tasks in real-time, facilitates time-sensitive applications and inter-constellation collaborations such as tip-and-cue. To evaluate OrbitChain, we implement a hardware-in-the-loop orbital computing testbed. Experiments show that our system can complete up to 60% analytics workload than existing Earth observation analytics framework while reducing the communication overhead by up to 72%.
comment: currently under review; corrected typo in author na,e
♻ ☆ A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design
Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.
comment: 4 pages, 1 figure
♻ ☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev\.minimum, Elev\.L3, lev\.MAD.mode, Elev\.mode, Elev\.MAD\.median, and Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
♻ ☆ Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
♻ ☆ Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds
In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.
comment: NeruIPS 2025
♻ ☆ Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.
comment: Neurips '25 version
♻ ☆ RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained by supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
comment: https://lei-kun.github.io/RL-100/
♻ ☆ Augmenting learning in neuro-embodied systems through neurobiological first principles
Recent progress in artificial intelligence (AI) has been driven by insights from physics and neuroscience, particularly through the development of artificial neural networks (ANNs) capable of complex cognitive tasks such as vision and language processing. Despite these advances, they struggle with continual learning, adaptable knowledge transfer, robustness, and resource efficiency -- capabilities that biological systems handle seamlessly. Specifically, neuromorphic systems and artificial neural networks often overlook two key biophysical properties of neural circuits: neuronal diversity and cell-specific neuromodulation. These mechanisms, essential for regulating dynamic learning across brain scales, allow neuromodulators to introduce degeneracy in biological neural networks, ensuring stability and adaptability under changing conditions. In this article, we summarize recent bioinspired models, learning rules, and architectures, and propose a framework for augmenting ANNs, which has the potential to bridge the gap between neuroscience and AI through neurobiological first principles. Our proposed dual-framework approach leverages spiking neural networks to emulate diverse spiking behaviors and dendritic compartmental dynamics, thereby simulating the morphological and functional diversity of neuronal computations. Finally, we outline how integrating these biophysical principles into task-driven spiking neural networks and neuromorphic systems provides scalable solutions for continual learning, adaptability, robustness, and resource-efficiency. Additionally, this approach will not only provide insights into how emergent behaviors arise in neural networks but also catalyze the development of more efficient, reliable, and intelligent neuromorphic systems and robotic agents.
comment: 26 pages, 4 figures, 3 boxes, 1 table
♻ ☆ Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning ICRA
Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2025
♻ ☆ What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)
While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.
♻ ☆ DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation
Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model's representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.
♻ ☆ Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
With the emergence of large model-based agents, widely adopted transformer-based architectures inevitably produce excessively long token embeddings for transmission, which may result in high bandwidth overhead, increased power consumption and latency. In this letter, we propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization. To improve the efficiency of token transmission, we design a two-stage training algotithm, including cross-modal alignment and task-oriented fine-tuning, for large model-based token communication. Meanwhile, token compression is performed using a sliding window pooling operation to save communication resources. To balance the trade-off between latency and model performance caused by compression, we formulate a weighted-sum optimization problem over latency and validation loss. We jointly optimizes bandwidth, power allocation, and token length across users by using an alternating optimization method. Simulation results demonstrate that the proposed algorithm outperforms the baseline under different bandwidth and power budgets. Moreover, the two-stage training algorithm achieves higher accuracy across various signal-to-noise ratios than the method without cross-modal alignment.
♻ ☆ Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules NeurIPS 2025
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
comment: 60 pages, accepted by NeurIPS 2025 as a spotlight paper
♻ ☆ Flat Channels to Infinity in Neural Loss Landscapes NeurIPS'25
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
comment: Accepted to NeurIPS'25
♻ ☆ Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks
The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.
comment: 23 pages, 9 figures, 4 tables (revised for Machine Learning Science and Technology)
♻ ☆ Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we observe preliminary correlations between internal fragility and miscalibration or hallucination (confabulation), and find that lightweight critique prompts may modestly improve or worsen calibration in small-scale tests. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens to diagnose and potentially mitigate overconfidence in reasoning systems.
comment: 21 pages, 2 figures, preliminary version
♻ ☆ Sharp Lower Bounds for Linearized ReLU^k Approximation on the Sphere
We prove a saturation theorem for linearized shallow ReLU$^k$ neural networks on the unit sphere $\mathbb S^d$. For any antipodally quasi-uniform set of centers, if the target function has smoothness $r>\tfrac{d+2k+1}{2}$, then the best $\mathcal{L}^2(\mathbb S^d)$ approximation cannot converge faster than order $n^{-\frac{d+2k+1}{2d}}$. This lower bound matches existing upper bounds, thereby establishing the exact saturation order $\tfrac{d+2k+1}{2d}$ for such networks. Our results place linearized neural-network approximation firmly within the classical saturation framework and show that, although ReLU$^k$ networks outperform finite elements under equal degrees $k$, this advantage is intrinsically limited.
♻ ☆ New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models LREC
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.
comment: under review @LREC
♻ ☆ Diversity-Aware Policy Optimization for Large Language Model Reasoning
The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
♻ ☆ Graph Neural Networks for Electricity Load Forecasting
Forecasting electricity demand is increasingly challenging as energy systems become more decentralized and intertwined with renewable sources. Graph Neural Networks (GNNs) have recently emerged as a powerful paradigm to model spatial dependencies in load data while accommodating complex non-stationarities. This paper introduces a comprehensive framework that integrates graph-based forecasting with attention mechanisms and ensemble aggregation strategies to enhance both predictive accuracy and interpretability. Several GNN architectures -- including Graph Convolutional Networks, GraphSAGE, APPNP, and Graph Attention Networks -- are systematically evaluated on synthetic, regional (France), and fine-grained (UK) datasets. Empirical results demonstrate that graph-aware models consistently outperform conventional baselines such as Feed Forward Neural Networks and foundation models like TiREX. Furthermore, attention layers provide valuable insights into evolving spatial interactions driven by meteorological and seasonal dynamics. Ensemble aggregation, particularly through bottom-up expert combination, further improves robustness under heterogeneous data conditions. Overall, the study highlights the complementarity between structural modeling, interpretability, and robustness, and discusses the trade-offs between accuracy, model complexity, and transparency in graph-based electricity load forecasting.
comment: 22 pages
♻ ☆ Rough Path Signatures: Learning Neural RDEs for Portfolio Optimization
We tackle high-dimensional, path-dependent valuation and control and introduce a deep BSDE/2BSDE solver that couples truncated log-signatures with a neural rough differential equation (RDE) backbone. The architecture aligns stochastic analysis with sequence-to-path learning: a CVaR-tilted terminal objective targets left-tail risk, while an optional second-order (2BSDE) head supplies curvature estimates for risk-sensitive control. Under matched compute and parameter budgets, the method improves accuracy, tail fidelity, and training stability across Asian and barrier option pricing and portfolio control: at d=200 it achieves CVaR(0.99)=9.80% versus 12.00-13.10% for strong baselines, attains the lowest HJB residual (0.011), and yields the lowest RMSEs for Z and Gamma. Ablations over truncation depth, local windows, and tilt parameters confirm complementary gains from the sequence-to-path representation and the 2BSDE head. Taken together, the results highlight a bidirectional dialogue between stochastic analysis and modern deep learning: stochastic tools inform representations and objectives, while sequence-to-path models expand the class of solvable financial models at scale.
comment: Code available at: https://github.com/AliAtiah/SigRDE
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Neural Entropy NeurIPS 2025
We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called neural entropy, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.
comment: 29 pages + references, 18 figures. Camera-ready version from NeurIPS 2025
♻ ☆ Multi-Agent Regime-Conditioned Diffusion (MARCD) for CVaR-Constrained Portfolio Decisions
We examine whether regime-conditioned generative scenarios combined with a convex CVaR allocator improve portfolio decisions under regime shifts. We present MARCD, a generative-to-decision framework with: (i) a Gaussian HMM to infer latent regimes; (ii) a diffusion generator that produces regime-conditioned scenarios; (iii) signal extraction via blended, shrunk moments; and (iv) a governed CVaR epigraph quadratic program. Contributions: Within the Scenario stage we introduce a tail-weighted diffusion objective that up-weights low-quantile outcomes relevant for drawdowns and a regime-expert (MoE) denoiser whose gate increases with crisis posteriors; both are evaluated end-to-end through the allocator. Under strict walk-forward on liquid multi-asset ETFs (2005-2025), MARCD exhibits stronger scenario calibration and materially smaller drawdowns: MaxDD 9.3% versus 14.1% for BL (a 34% reduction) over 2020-2025 out-of-sample. The framework provides an auditable pipeline with explicit budget, box, and turnover constraints, demonstrating the value of decision-aware generative modeling in finance.
comment: Code available at: https://github.com/AliAtiah/MARCD
♻ ☆ NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables
On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.
♻ ☆ AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems ECAI24
Molecular dynamics (MD) simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently emerged as a "human-in-the-loop" strategy for efficiently navigating hyper-dimensional molecular systems. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular simulations running on high-performance computing architectures, iMD-VR enables researchers to reach out and guide molecular conformational dynamics, in order to efficiently explore complex, high-dimensional molecular systems. Moreover, iMD-VR simulations generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the use of researcher-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL enables agents to mimic complex behaviours from expert demonstrations, circumventing the need for explicit programming or intricate reward design. In this article, we review IL across robotics and Multi-agents systems domains which are comparable to iMD-VR, and discuss how iMD-VR recordings could be used to train IL models to interact with MD simulations. We then illustrate the applications of these ideas through a proof-of-principle study where iMD-VR data was used to train a CNN network on a simple molecular manipulation task; namely, threading a small molecule through a nanotube pore. Finally, we outline future research directions and potential challenges of using AI agents to augment human expertise in navigating vast molecular conformational spaces.
comment: (First presented at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)
♻ ☆ UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
comment: Accepted to RSS 2025. Code is available at https://github.com/OpenDriveLab/UniVLA
♻ ☆ Differentiable Generalized Sliced Wasserstein Plans
Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions -- such as the Wasserstein distance -- but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation -- where fast computation of transport plans is essential.
comment: LC and RT have equal contribution
♻ ☆ ConTextTab: A Semantics-Aware Tabular In-Context Learner NeurIPS 2025
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.
comment: Accepted as spotlight at NeurIPS 2025
♻ ☆ Representation Consistency for Accurate and Coherent LLM Answer Aggregation NeurIPS 2025
Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
comment: Accepted at NeurIPS 2025. Camera-ready version
♻ ☆ A probabilistic view on Riemannian machine learning models for SPD matrices
The goal of this paper is to show how different machine learning tools on the Riemannian manifold $\mathcal{P}_d$ of Symmetric Positive Definite (SPD) matrices can be united under a probabilistic framework. For this, we will need several Gaussian distributions defined on $\mathcal{P}_d$. We will show how popular classifiers on $\mathcal{P}_d$ can be reinterpreted as Bayes Classifiers using these Gaussian distributions. These distributions will also be used for outlier detection and dimension reduction. By showing that those distributions are pervasive in the tools used on $\mathcal{P}_d$, we allow for other machine learning tools to be extended to $\mathcal{P}_d$.
♻ ☆ Geospatial Foundation Models to Enable Progress on Sustainable Development Goals
Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.
♻ ☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
♻ ☆ Contextual Tokenization for Graph Inverted Indices
Retrieving graphs from a large corpus, that contain a subgraph isomorphic to a given query graph, is a core operation in many real-world applications. While recent multi-vector graph representations and scores based on set alignment and containment can provide accurate subgraph isomorphism tests, their use in retrieval remains limited by their need to score corpus graphs exhaustively. We introduce CORGII (Contextual Representation of Graphs for Inverted Indexing), a graph indexing framework in which, starting with a contextual dense graph representation, a differentiable discretization module computes sparse binary codes over a learned latent vocabulary. This text document-like representation allows us to leverage classic, highly optimized inverted indices, while supporting soft (vector) set containment scores. Pushing this paradigm further, we replace the classical, fixed impact weight of a `token' on a graph (such as TFIDF or BM25) with a data-driven, trainable impact weight. Finally, we explore token expansion to support multi-probing the index for smoother accuracy-efficiency tradeoffs. To our knowledge, CORGII is the first indexer of dense graph representations using discrete tokens mapping to efficient inverted lists. Extensive experiments show that CORGII provides better trade-offs between accuracy and efficiency, compared to several baselines.
♻ ☆ Deep Modularity Networks with Diversity-Preserving Regularization NeurIPS 2025
Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they lack explicit mechanisms for feature-space separation, assignment dispersion, and assignment-confidence control. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for per-cluster assignment dispersion, and an assignment-entropy penalty with a small positive weight, encouraging more confident assignments gradually. Our method significantly enhances label-based clustering metrics on feature-rich benchmark datasets (paired two-tailed t-test, $p\leq0.05$), demonstrating the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: New Perspectives in Graph Machine Learning (NPGML)
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach
Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. This study compares the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error that is 7% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.37% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.
♻ ☆ Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models
What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.
♻ ☆ Tight analyses of first-order methods with error feedback
Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably $\mathrm{EF}$ and $\mathrm{EF}^{21}$ -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between $\mathrm{EF}$, $\mathrm{EF}^{21}$, and compressed gradient descent. Our analysis is carried out in the simplified single-agent setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.
♻ ☆ Low-Rank Adaptation for Foundation Models: A Comprehensive Review
The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.
♻ ☆ ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models
Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
♻ ☆ Inducing Riesz and orthonormal bases in $L^2$ via composition operators
Let $C_h$ be a composition operator mapping $L^2(\Omega_1)$ into $L^2(\Omega_2)$ for some open sets $\Omega_1, \Omega_2 \subseteq \mathbb{R}^n$. We characterize the mappings $h$ that transform Riesz bases of $L^2(\Omega_1)$ into Riesz bases of $L^2(\Omega_2)$. Restricting our analysis to differentiable mappings, we demonstrate that mappings $h$ that preserve Riesz bases have Jacobian determinants that are bounded away from zero and infinity. We discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct Riesz bases with favorable approximation properties.
♻ ☆ What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
Let $A \in \mathbb{R}^{m \times n}$ be an arbitrary, known matrix and $e$ a $q$-sparse adversarial vector. Given $y = A x^\star + e$ and $q$, we seek the smallest set containing $x^\star$--hence the one conveying maximal information about $x^\star$--that is uniformly recoverable from $y$ without knowing $e$. While exact recovery of $x^\star$ via strong (and often impractical) structural assumptions on $A$ or $x^\star$ (e.g., restricted isometry, sparsity) is well studied, recoverability for arbitrary $A$ and $x^\star$ remains open. Our main result shows that the best that one can hope to recover is $x^\star + \ker(U)$, where $U$ is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of $A$ obtained by deleting $2q$ rows. Moreover, we prove that every $x$ that minimizes the $\ell_0$-norm of $y - A x$ lies in $x^\star + \ker(U)$, which then gives a constructive approach to recover this set.
comment: \copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ A Self-Evolving AI Agent System for Climate Science
Scientific progress in Earth science depends on integrating data across the planet's interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni\~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.
♻ ☆ Localist LLMs -- A Mathematical Framework for Dynamic Locality Control
We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
♻ ☆ SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning NeurIPS 2025
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at https://github.com/MichaelMaiii/SynBrain.
comment: Accepted by NeurIPS 2025
♻ ☆ Bellman Diffusion Models
Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.
♻ ☆ MARFT: Multi-Agent Reinforcement Fine-Tuning
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.
comment: 42 pages
♻ ☆ Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning NeurIPS 2025
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
comment: Accepted at NeurIPS 2025. Updated with minor corrections and additional experiments
♻ ☆ Language-Driven Coordination and Learning in Multi-Agent Simulation Environments
This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.
♻ ☆ DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at https://github.com/JooyoungLeeETRI/DeepHQ
comment: Accepted to ACM TOMM (2025)
♻ ☆ Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. The experiments show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay, fulfilling the operational standard for minute-level precision. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory data, enhances delay prediction. Moreover, the approach shows practicality and potential scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.
comment: Preprint submitted to Aerospace Science and Technology (Elsevier) for possible publication
♻ ☆ FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA family models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.1 on WikiText2. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.
♻ ☆ Amortized Active Generation of Pareto Sets NeurIPS 2025
We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.
comment: Appears in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Complex QA and language models hybrid architectures, Survey
This paper reviews the state-of-the-art of large language models (LLM) architectures and strategies for "complex" question-answering with a focus on hybrid architectures. LLM based chatbot services have allowed anyone to grasp the potential of LLM to solve many common problems, but soon discovered their limitations for complex questions. Addressing more specific, complex questions (e.g., "What is the best mix of power-generation methods to reduce climate change ?") often requires specialized architectures, domain knowledge, new skills, decomposition and multi-step resolution, deep reasoning, sensitive data protection, explainability, and human-in-the-loop processes. Therefore, we review: (1) necessary skills and tasks for handling complex questions and common LLM limits to overcome; (2) dataset, cost functions and evaluation metrics for measuring and improving (e.g. accuracy, explainability, fairness, robustness, groundedness, faithfulness, toxicity...); (3) family of solutions to overcome LLM limitations by (a) training and reinforcement (b) hybridization, (c) prompting, (d) agentic-architectures (agents, tools) and extended reasoning.
♻ ☆ MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.
comment: Published in Transactions on Machine Learning Research (09/2025)
♻ ☆ Riemannian Consistency Model NeurIPS 2025
Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3).
comment: Accepted to NeurIPS 2025
♻ ☆ Scientific Machine Learning with Kolmogorov-Arnold Networks
The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep-operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs' advantages in capturing complex dynamics while learning more effectively. In addition to reviewing recent literature, this work also presents several comparative evaluations that clarify central characteristics of KAN modeling and hint at their potential implications for real-world applications. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.
♻ ☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
♻ ☆ Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness NeurIPS 2025
Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.
comment: Accepted for the proceedings of 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Learning Nonholonomic Dynamics with Constraint Discovery
We consider learning nonholonomic dynamical systems while discovering the constraints, and describe in detail the case of the rolling disk. A nonholonomic system is a system subject to nonholonomic constraints. Unlike holonomic constraints, nonholonomic constraints do not define a sub-manifold on the configuration space. Therefore, the inverse problem of finding the constraints has to involve the tangent bundle. This paper discusses a general procedure to learn the dynamics of a nonholonomic system through Hamel's formalism, while discovering the system constraint by parameterizing it, given the data set of discrete trajectories on the tangent bundle $TQ$. We prove that there is a local minimum for convergence of the network. We also preserve symmetry of the system by reducing the Lagrangian to the Lie algebra of the selected group.
♻ ☆ Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
comment: v2 note: Corrected numerical values in Table 2 and Figure 4 due to a minor calculation error in v1. The overall conclusions remain unchanged. Code and models available at: https://github.com/tianciB/VFM-VAE
♻ ☆ Computational Basis of LLM's Decision Making in Social Simulation
Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game -- a classic behavioral experiment on fairness and prosocial behavior. We extract "vectors of variable variations" (e.g., "male" to "female") from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.
♻ ☆ Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation
Accurate state estimation is fundamental to intelligent vehicles. Wheel load, one of the most important chassis states, serves as an essential input for advanced driver assistance systems (ADAS) and exerts a direct influence on vehicle stability and safety. However, wheel load estimation remains challenging due to the complexity of chassis modeling and the susceptibility of nonlinear systems to noise. To address these issues, this paper first introduces a refined suspension linkage-level modeling approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon this, we propose a damper characteristics-based Bayesian physics-informed neural network (Damper-B-PINN) framework to estimate dynamic wheel load, which leverages the suspension dynamics as physical guidance of PINN while employing Bayesian inference to mitigate the effects of system noise and uncertainty. Moreover, a damper-characteristic physics conditioning (DPC) module is designed for embedding physical prior. The proposed Damper-B-PINN is evaluated using both high-fidelity simulation datasets generated by CarSim software and real-world datasets collected from a Formula Student race car. Experimental results demonstrate that our Damper-B-PINN consistently outperforms existing methods across various test conditions, particularly extreme ones. These findings highlight the potential of the proposed Damper-B-PINN framework to enhance the accuracy and robustness of dynamic wheel load estimation, thereby improving the reliability and safety of ADAS applications.
♻ ☆ Provable Generalization Bounds for Deep Neural Networks with Momentum-Adaptive Gradient Dropout
Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop's effectiveness, we derive a non-asymptotic, computable PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 29.2\% tighter bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop achieves competitive performance on MNIST (99.52\%) and CIFAR-10 (92.03\%), with generalization gaps of 0.48\% and 6.52\%, respectively. We provide fully reproducible code and numerical computation of our bounds to validate our theoretical claims. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization, making it suitable for high-stakes applications.
comment: 8 pages
♻ ☆ Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
comment: Accepted by Medical Image Analysis. 24 pages, 13 figures, 20 tabels
♻ ☆ One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra NeurIPS-2025
A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.
comment: Accepted at AI4Mat-NeurIPS-2025 Workshop
♻ ☆ Neighboring State-based Exploration for Reinforcement Learning
Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49% in terms of Eval Reward Return.
♻ ☆ MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation EMNLP 2025
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST
comment: EMNLP 2025
♻ ☆ A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications NeurIPS 2025
The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
comment: This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Trustworthy AI Must Account for Interactions ICLR 2025
Trustworthy AI encompasses many aspirational aspects for aligning AI systems with human values, including fairness, privacy, robustness, explainability, and uncertainty quantification. Ultimately the goal of Trustworthy AI research is to achieve all aspects simultaneously. However, efforts to enhance one aspect often introduce unintended trade-offs that negatively impact others. In this position paper, we review notable approaches to these five aspects and systematically consider every pair, detailing the negative interactions that can arise. For example, applying differential privacy to model training can amplify biases, undermining fairness. Drawing on these findings, we take the position that current research practices of improving one or two aspects in isolation are insufficient. Instead, research on Trustworthy AI must account for interactions between aspects and adopt a holistic view across all relevant axes at once. To illustrate our perspective, we provide guidance on how practitioners can work towards integrated trust, examples of how interactions affect the financial industry, and alternative views.
comment: Presented at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment
♻ ☆ Dataset Distillation for Offline Reinforcement Learning ICML 2024
Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .
comment: ICML 2024 DMLR Workshop Our project site is available at https://datasetdistillation4rl.github.io We also provide our implementation at https://github.com/ggflow123/DDRL
♻ ☆ Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models NeurIPS 2025
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
comment: Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/
♻ ☆ A Basic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm
This paper presents a comprehensive formulation of Kaneko's Error Diffusion Learning Algorithm (EDLA) and evaluates its effectiveness across parity check, regression, and image classification tasks. EDLA is a biologically inspired learning algorithm that provides an alternative to conventional backpropagation for training artificial neural networks. EDLA employs a single global error signal that diffuses across networks composed of paired positive and negative sublayers, eliminating traditional layer-wise error backpropagation. This study evaluates EDLA's effectiveness using benchmark tasks, such as parity check, regression, and image classification, by systematically varying the neuron count, network depth, and learning rates to assess its performance comprehensively. The experimental results demonstrate that EDLA achieves consistently high accuracy across multiple benchmarks, highlighting its effectiveness as a learning algorithm for neural networks. The choice of learning rate, neuron count, and network depth significantly influences EDLA's efficiency and convergence speed. Analysis of internal network representations reveals meaningful feature extraction capabilities, and the network's overall performance is found to be competitive with networks trained via conventional backpropagation, especially in shallow architectures. This study introduces EDLA, a biologically plausible alternative to traditional backpropagation previously underrecognized due to language barriers. By reformulating EDLA, systematically evaluating its performance, and presenting empirical evidence of its effectiveness, this study increases the visibility and accessibility of EDLA and contributes to biologically inspired training methodologies.
♻ ☆ Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression NeurIPS 2025
We study gradient descent (GD) with a constant stepsize for $\ell_2$-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in $\widetilde{\mathcal{O}}(\kappa)$ steps with $\kappa$ being the condition number. Surprisingly, we show that this can be accelerated to $\widetilde{\mathcal{O}}(\sqrt{\kappa})$ by simply using a large stepsize -- for which the objective evolves nonmonotonically. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a near-optimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.
comment: NeurIPS 2025 camera ready version
♻ ☆ Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.
comment: 10 pages, 4 figures; submitted to The IEICE Transactions on Information and Systems
♻ ☆ Generating Auxiliary Tasks with Reinforcement Learning
Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.
♻ ☆ Feature compression is the root cause of adversarial fragility in neural network classifiers
In this paper, we uniquely study the adversarial robustness of deep neural networks (NN) for classification tasks against that of optimal classifiers. We look at the smallest magnitude of possible additive perturbations that can change a classifier's output. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural networks for classification. In particular, our theoretical results show that a neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically, we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness of optimal classifiers. Our theories match remarkably well with numerical experiments of practically trained NN, including NN for ImageNet images. The matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks.
comment: new experiments
♻ ☆ MicroLad: 2D-to-3D Microstructure Reconstruction and Generation via Latent Diffusion and Score Distillation
A major obstacle to establishing reliable structure-property (SP) linkages in materials engineering is the scarcity of diverse 3D microstructure datasets. Limited dataset availability and insufficient control over the analysis and design space restrict the variety of achievable microstructure morphologies, hindering progress in solving the inverse (property-to-structure) design problem. To address these challenges, we introduce MicroLad, a latent diffusion framework specifically designed for reconstructing 3D microstructures from 2D data. Trained on 2D images and employing multi-plane denoising diffusion sampling in the latent space, the framework reliably generates stable and coherent 3D volumes that remain statistically consistent with the original data. While this reconstruction capability enables dimensionality expansion (2D-to-3D) for generating statistically equivalent 3D samples from 2D data, effective exploration of microstructure design requires methods to guide the generation process toward specific objectives. To achieve this, MicroLad integrates score distillation sampling (SDS), which combines a differentiable score loss with microstructural descriptor-matching and property-alignment terms. This approach updates encoded 2D slices of the 3D volume in the latent space, enabling robust inverse-controlled 2D-to-3D microstructure generation. Consequently, the method facilitates exploration of an expanded 3D microstructure analysis and design space in terms of both microstructural descriptors and material properties.
♻ ☆ A Compositional Kernel Model for Feature Learning
We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.
comment: Fix Typos
♻ ☆ Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
comment: Updated manuscript. Added new experiments, figures, and exposition
♻ ☆ Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
♻ ☆ Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach
Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being non-stationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection have been previously proposed. Our goal is to modularize the design and analysis of such Detection Augmented Bandit (DAB) procedures. To this end, we first provide novel, improved performance lower bounds for PS-MABs. Then, we identify the requirements for stationary bandit algorithms and change detectors in a DAB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of DAB procedures can indeed be modularized, so that the regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular DAB procedures that are order-optimal. Finally, we showcase the practical effectiveness of our modular DAB approach in our experiments, studying its regret performance compared to other methods and investigating its detection capabilities.
comment: 30 pages, 4 figures, 1 table, submitted to TIT
♻ ☆ Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data
We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.
comment: 41 pages, 18 figures
♻ ☆ Revisiting Multivariate Time Series Forecasting with Missing Values
Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.
♻ ☆ Closing the Intent-to-Behavior Gap via Fulfillment Priority Logic
Practitioners designing reinforcement learning policies face a fundamental challenge: translating intended behavioral objectives into representative reward functions. This challenge stems from behavioral intent requiring simultaneous achievement of multiple competing objectives, typically addressed through labor-intensive linear reward composition that yields brittle results. Consider the ubiquitous robotics scenario where performance maximization directly conflicts with energy conservation. Such competitive dynamics are resistant to simple linear reward combinations. In this paper, we present the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL). FPL allows practitioners to define logical formula representing their intentions and priorities within multi-objective reinforcement learning. Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500\% better sample efficiency compared to Soft Actor Critic. Notably, this work constitutes the first implementation of non-linear utility scalarization design, specifically for continuous control problems.
♻ ☆ AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.
comment: The authors are withdrawing this version to correct issues identified in the experimental design and analysis. A revised and validated version will be submitted after further review
♻ ☆ AutoPDL: Automatic Prompt Optimization for LLM Agents
The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and specific to a given LLM and task. Therefore, this paper proposes AutoPDL, an automated approach to discovering good LLM agent configurations. Our approach frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and seven LLMs (ranging from 3B to 70B parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points), up to 67.5pp, and reveal that selected prompting strategies vary across models and tasks.
comment: An earlier version of this paper was published in AutoML 2025 Methods Track. This version adds missing standard deviations in Table 1
♻ ☆ MediQ-GAN: Quantum-Inspired GAN for High Resolution Medical Image Generation
Machine learning-assisted diagnosis shows promise, yet medical imaging datasets are often scarce, imbalanced, and constrained by privacy, making data augmentation essential. Classical generative models typically demand extensive computational and sample resources. Quantum computing offers a promising alternative, but existing quantum-based image generation methods remain limited in scale and often face barren plateaus. We present MediQ-GAN, a quantum-inspired GAN with prototype-guided skip connections and a dual-stream generator that fuses classical and quantum-inspired branches. Its variational quantum circuits inherently preserve full-rank mappings, avoid rank collapse, and are theory-guided to balance expressivity with trainability. Beyond generation quality, we provide the first latent-geometry and rank-based analysis of quantum-inspired GANs, offering theoretical insight into their performance. Across three medical imaging datasets, MediQ-GAN outperforms state-of-the-art GANs and diffusion models. While validated on IBM hardware for robustness, our contribution is hardware-agnostic, offering a scalable and data-efficient framework for medical image generation and augmentation.
♻ ☆ Training Language Models to Reason Efficiently NeurIPS 2025
Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
comment: NeurIPS 2025
♻ ☆ Expertise and confidence explain how social influence evolves along intellective tasks
Discovering the antecedents of individuals' influence in collaborative environments is an important, practical, and challenging problem. In this paper, we study interpersonal influence in small groups of individuals who collectively execute a sequence of intellective tasks. We observe that along an issue sequence with feedback, individuals with higher expertise and social confidence are accorded higher interpersonal influence. We also observe that low-performing individuals tend to underestimate their high-performing teammate's expertise. Based on these observations, we introduce three hypotheses and present empirical and theoretical support for their validity. We report empirical evidence on longstanding theories of transactive memory systems, social comparison, and confidence heuristics on the origins of social influence. We propose a cognitive dynamical model inspired by these theories to describe the process by which individuals adjust interpersonal influences over time. We demonstrate the model's accuracy in predicting individuals' influence and provide analytical results on its asymptotic behavior for the case with identically performing individuals. Lastly, we propose a novel approach using deep neural networks on a pre-trained text embedding model for predicting the influence of individuals. Using message contents, message times, and individual correctness collected during tasks, we are able to accurately predict individuals' self-reported influence over time. Extensive experiments verify the accuracy of the proposed models compared to baselines such as structural balance and reflected appraisal model. While the neural networks model is the most accurate, the dynamical model is the most interpretable for influence prediction.
♻ ☆ Relational Causal Discovery with Latent Confounders UAI 2025
Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.
comment: 30 pages, 19 figures. Accepted for publication at the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025). Andrea Piras and Matteo Negro contributed equally to this work
♻ ☆ A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.
♻ ☆ Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement
Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.
♻ ☆ Two-Player Zero-Sum Games with Bandit Feedback
We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(\Delta + \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T \Delta^2) / \Delta)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $\Delta$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.
comment: 21 pages
♻ ☆ Aggregation of Published Non-Uniform Axial Power Data for Phase II of the OECD/NEA AI/ML Critical Heat Flux Benchmark
Critical heat flux (CHF) marks the onset of boiling crisis in light-water reactors, defining safe thermal-hydraulic operating limits. To support Phase II of the OECD/NEA AI/ML CHF benchmark, which introduces spatially varying power profiles, this work compiles and digitizes a broad CHF dataset covering both uniform and non-uniform axial heating conditions. Heating profiles were extracted from technical reports, interpolated onto a consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats for benchmark compatibility. Classical CHF correlations exhibit substantial errors under uniform heating and degrade markedly when applied to non-uniform profiles, while modern tabular methods offer improved but still imperfect predictions. A neural network trained solely on uniform data performs well in that regime but fails to generalize to spatially varying scenarios, underscoring the need for models that explicitly incorporate axial power distributions. By providing these curated datasets and baseline modeling results, this study lays the groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in the next phase of the CHF benchmark.
♻ ☆ HADSF: Aspect Aware Semantic Control for Explainable Recommendation WSDM 2026
Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at https://github.com/niez233/HADSF
comment: Accepted by WSDM 2026
♻ ☆ Reflections from Research Roundtables at the Conference on Health, Inference, and Learning (CHIL) 2025
The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year's program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of "Explainability, Interpretability, and Transparency," "Uncertainty, Bias, and Fairness," "Causality," "Domain Adaptation," "Foundation Models," "Learning from Small Medical Data," "Multimodal Methods," and "Scalable, Translational Healthcare Solutions."
♻ ☆ Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking NeurIPS 2025
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
comment: Accepted at NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Implicit Bias in Matrix Factorization and its Explicit Realization in a New Architecture
Gradient descent for matrix factorization exhibits an implicit bias toward approximately low-rank solutions. While existing theories often assume the boundedness of iterates, empirically the bias persists even with unbounded sequences. This reflects a dynamic where factors develop low-rank structure while their magnitudes increase, tending to align with certain directions. To capture this behavior in a stable way, we introduce a new factorization model: $X\approx UDV^\top$, where $U$ and $V$ are constrained within norm balls, while $D$ is a diagonal factor allowing the model to span the entire search space. Experiments show that this model consistently exhibits a strong implicit bias, yielding truly (rather than approximately) low-rank solutions. Extending the idea to neural networks, we introduce a new model featuring constrained layers and diagonal components that achieves competitive performance on various regression and classification tasks while producing lightweight, low-rank representations.
♻ ☆ E2Former: An Efficient and Equivariant Transformer with Linear-Scaling Tensor Products
Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.
♻ ☆ Behavior of prediction performance metrics with rare events
Objective: Area under the receiving operator characteristic curve (AUC) is commonly reported alongside prediction models for binary outcomes. Recent articles have raised concerns that AUC might be a misleading measure of prediction performance in the rare event setting. This setting is common since many events of clinical importance are rare. We aimed to determine whether the bias and variance of AUC are driven by the number of events or the event rate. We also investigated the behavior of other commonly used measures of prediction performance, including positive predictive value, accuracy, sensitivity, and specificity. Study Design and Setting: We conducted a simulation study to determine when or whether AUC is unstable in the rare event setting by varying the size of datasets used to train and evaluate prediction models. This plasmode simulation study was based on data from the Mental Health Research Network; the data contained 149 predictors and the outcome of interest, suicide attempt, which had event rate 0.92\% in the original dataset. Results: Our results indicate that poor AUC behavior -- as measured by empirical bias, variability of cross-validated AUC estimates, and empirical coverage of confidence intervals -- is driven by the number of events in a rare-event setting, not event rate. Performance of sensitivity is driven by the number of events, while that of specificity is driven by the number of non-events. Other measures, including positive predictive value and accuracy, depend on the event rate even in large samples. Conclusion: AUC is reliable in the rare event setting provided that the total number of events is moderately large; in our simulations, we observed near zero bias with 1000 events.
comment: Accepted for publication in the Journal of Clinical Epidemiology. 51 pages (16 main, 35 supplementary), 26 tables (3 main, 23 supplementary), 6 figures (4 main, 2 supplementary)
Graphics 7
☆ Example-Based Feature Painting on Textures
In this work, we propose a system that covers the complete workflow for achieving controlled authoring and editing of textures that present distinctive local characteristics. These include various effects that change the surface appearance of materials, such as stains, tears, holes, abrasions, discoloration, and more. Such alterations are ubiquitous in nature, and including them in the synthesis process is crucial for generating realistic textures. We introduce a novel approach for creating textures with such blemishes, adopting a learning-based approach that leverages unlabeled examples. Our approach does not require manual annotations by the user; instead, it detects the appearance-altering features through unsupervised anomaly detection. The various textural features are then automatically clustered into semantically coherent groups, which are used to guide the conditional generation of images. Our pipeline as a whole goes from a small image collection to a versatile generative model that enables the user to interactively create and paint features on textures of arbitrary size. Notably, the algorithms we introduce for diffusion-based editing and infinite stationary texture generation are generic and should prove useful in other contexts as well. Project page: https://reality.tf.fau.de/pub/ardelean2025examplebased.html
comment: "\c{opyright} 2025 Andrei-Timotei Ardelean, Tim Weyrich. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Trans. Graph., Vol. 44, No. 6, https://doi.org/10.1145/3763301
☆ HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.
comment: 10 pages, 5figures. The Thirty-Ninth Annual Conference on Neural Information Processing Systems
☆ An Adjoint Method for Differentiable Fluid Simulation on Flow Maps
This paper presents a novel adjoint solver for differentiable fluid simulation based on bidirectional flow maps. Our key observation is that the forward fluid solver and its corresponding backward, adjoint solver share the same flow map as the forward simulation. In the forward pass, this map transports fluid impulse variables from the initial frame to the current frame to simulate vortical dynamics. In the backward pass, the same map propagates adjoint variables from the current frame back to the initial frame to compute gradients. This shared long-range map allows the accuracy of gradient computation to benefit directly from improvements in flow map construction. Building on this insight, we introduce a novel adjoint solver that solves the adjoint equations directly on the flow map, enabling long-range and accurate differentiation of incompressible flows without differentiating intermediate numerical steps or storing intermediate variables, as required in conventional adjoint methods. To further improve efficiency, we propose a long-short time-sparse flow map representation for evolving adjoint variables. Our approach has low memory usage, requiring only 6.53GB of data at a resolution of $192^3$ while preserving high accuracy in tracking vorticity, enabling new differentiable simulation tasks that require precise identification, prediction, and control of vortex dynamics.
comment: 15 pages, 16 figures
☆ Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
comment: 23 pages, 10 figures. The last two authors made equal contributions
♻ ☆ GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.
♻ ☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
comment: Project Page: https://czzzzh.github.io/FreeArt3D Code: https://github.com/CzzzzH/FreeArt3D
♻ ☆ Near Real-time Adaptive Isotropic and Anisotropic Image-to-mesh Conversion for Numerical Simulations Involving Cerebral Aneurysms
Presented are two techniques that are designed to help streamline the discretization of complex vascular geometries within the numerical modeling process. The first method integrates multiple software tools into a single pipeline which can generate adaptive anisotropic meshes from segmented medical images. The pipeline is shown to satisfy quality, fidelity, smoothness, and robustness requirements while providing near real-time performance for medical image-to-mesh conversion. The second method approximates a user-defined sizing function to generate adaptive isotropic meshes of good quality and fidelity in real-time. Tested with two brain aneurysm cases and utilizing up to 96 CPU cores within a single, multicore node on Purdue University's Anvil supercomputer, the parallel adaptive anisotropic meshing method utilizes a hierarchical load balancing model (designed for large, cc-NUMA shared memory architectures) and contains an optimized local reconnection operation that performs three times faster than its original implementation from previous studies. The adaptive isotropic method is shown to generate a mesh of up to approximately 50 million elements in less than a minute while the adaptive anisotropic method is shown to generate approximately the same number of elements in about 5 minutes.
comment: 57 pages, 16 figures, 13 tables, presented at the 18th U.S. National Congress on Computational Mechanics conference
Robotics 5
♻ ☆ Dropping the D: RGB-D SLAM Without the Depth Sensor
We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
♻ ☆ Robust Trajectory Generation and Control for Quadrotor Motion Planning with Field-of-View Control Barrier Certification
Many approaches to multi-robot coordination are susceptible to failure due to communication loss and uncertainty in estimation. We present a real-time communication-free distributed navigation algorithm certified by control barrier functions, that models and controls the onboard sensing behavior to keep neighbors in the limited field of view for position estimation. The approach is robust to temporary tracking loss and directly synthesizes control to stabilize visual contact through control Lyapunov-barrier functions. The main contributions of this paper are a continuous-time robust trajectory generation and control method certified by control barrier functions for distributed multi-robot systems and a discrete optimization procedure, namely, MPC-CBF, to approximate the certified controller. In addition, we propose a linear surrogate of high-order control barrier function constraints and use sequential quadratic programming to solve MPC-CBF efficiently.
comment: 8 pages, 8 figures, 3 tables, accepted to RA-L 2025
♻ ☆ MOSPA: Human Motion Generation Driven by Spatial Audio NeurIPS 2025
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving ICRA 2025
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
comment: 8 pages, 3 figures, ICRA 2025
♻ ☆ Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting
Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly.
comment: Associated dataset: https://zenodo.org/records/15861758
Artificial Intelligence 47
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys KDD 2025
The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at https://github.com/annihi1ation/auto_research.
comment: TKDD 2025
♻ ☆ A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models
Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.
comment: This paper is currently under review
♻ ☆ Multi-Step Reasoning with Large Language Models, a Survey
Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.
comment: ACM Computing Surveys
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Multi-head Temporal Latent Attention NeurIPS 2025
While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.
comment: Accepted by NeurIPS 2025
♻ ☆ How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
♻ ☆ Solving Inequality Proofs with Large Language Models NeurIPS 2025
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
comment: 50 pages, 24 figures, accepted as a Spotlight at NeurIPS 2025
♻ ☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.
♻ ☆ Agentic Large Language Models for Conceptual Systems Engineering and Design
Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20%). Code compatibility peaked at 100% under specific 2AS settings but averaged below 50% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.
comment: 32 pages, 4 figures
♻ ☆ A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.
comment: Currently Under Review
♻ ☆ Over-squashing in Spatiotemporal Graph Neural Networks NeurIPS 2025
Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.
comment: Accepted at NeurIPS 2025
♻ ☆ Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users
Deep learning models are widely used in decision-making and recommendation systems, where they typically rely on the assumption of a static data distribution between training and deployment. However, real-world deployment environments often violate this assumption. Users who receive negative outcomes may adapt their features to meet model criteria, i.e., recourse action. These adaptive behaviors create shifts in the data distribution and when models are retrained on this shifted data, a feedback loop emerges: user behavior influences the model, and the updated model in turn reshapes future user behavior. Despite its importance, this bidirectional interaction between users and models has received limited attention. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Both the theoretical and empirical analyses show that user recourse behavior tends to push logistic and MLP models toward increasingly higher decision standards, resulting in higher recourse costs and less reliable recourse actions over time. To mitigate these challenges, we propose two methods--Fair-top-k and Dynamic Continual Learning (DCL)--which significantly reduce recourse cost and improve model robustness. Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
comment: 13 pages,4 figures, 3 tables
♻ ☆ Debiasing LLMs by Masking Unfairness-Driving Attention Heads
Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.
♻ ☆ Exploring the limits of strong membership inference attacks on large language models NeurIPS 2025
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.
comment: NeurIPS 2025
♻ ☆ Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals NeurIPS 2025
Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our fact-checking dataset captures naturally occurring misinformation by constructing its retrieval corpus from Reddit discussions. It categorizes retrieved evidence into three types: supporting, misleading, and unrelated, providing a realistic and challenging testbed for assessing how well RAG systems navigate different types of evidence. Our experiments reveal that, when exposed to potentially misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), while human annotators consistently perform better, highlighting LLMs' susceptibility to noisy environments. To our knowledge, RAGuard is the first benchmark to systematically assess the robustness of the RAG against misleading evidence. We expect this benchmark to drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications. The dataset is available at https://huggingface.co/datasets/UCSC-IRKM/RAGuard.
comment: Advances in Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Why Attention Fails: A Taxonomy of Faults in Attention-Based Neural Networks
Attention mechanisms are at the core of modern neural architectures, powering systems ranging from ChatGPT to autonomous vehicles and driving a major economic impact. However, high-profile failures, such as ChatGPT's nonsensical outputs or Google's suspension of Gemini's image generation due to attention weight errors, highlight a critical gap: existing deep learning fault taxonomies might not adequately capture the unique failures introduced by attention mechanisms. This gap leaves practitioners without actionable diagnostic guidance. To address this gap, we present the first comprehensive empirical study of faults in attention-based neural networks (ABNNs). Our work is based on a systematic analysis of 555 real-world faults collected from 96 projects across ten frameworks, including GitHub, Hugging Face, and Stack Overflow. Through our analysis, we develop a novel taxonomy comprising seven attention-specific fault categories, not captured by existing work. Our results show that over half of the ABNN faults arise from mechanisms unique to attention architectures. We further analyze the root causes and manifestations of these faults through various symptoms. Finally, by analyzing symptom-root cause associations, we identify four evidence-based diagnostic heuristics that explain 33.0% of attention-specific faults, offering the first systematic diagnostic guidance for attention-based models.
♻ ☆ Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding NeurIPS 2025
Test-time scaling enhances large language model performance by allocating additional compute resources during inference. Best-of-N (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost-performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating N full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose Self-Truncation Best-of-N (ST-BoN), a decoding method that avoids fully generating all N samples and eliminates the need for reward models. It leverages early sampling consistency in the model's internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost-performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%-80%, and under the same cost, it can improve accuracy by 3-4 points.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
♻ ☆ Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving ICRA 2025
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
comment: 8 pages, 3 figures, ICRA 2025
♻ ☆ Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing
This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.
comment: This article has been prepared to be submitted to the Fundamenta Informaticae Journal
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization
This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
♻ ☆ Mantis: A Simulation-Grounded Foundation Model for Disease Forecasting
Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for disease-specific data, bespoke training, and expert tuning. We introduce Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 48 forecasting models across six diseases with diverse transmission modes, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub when backtested on early pandemic forecasts. Across all other diseases tested, including respiratory, vector-borne, and waterborne pathogens, Mantis consistently ranked in the top two models across all evaluation metrics. Notably, Mantis generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it captures fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities position Mantis as a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models fail.
comment: 10 pages, 4 figures
♻ ☆ SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
♻ ☆ Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.
♻ ☆ EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes
Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.
comment: 12 pages, 12 figures, 7 tables. Under Review
♻ ☆ Pareto-NRPA: A Novel Monte-Carlo Search Algorithm for Multi-Objective Optimization ECAI 2025
We introduce Pareto-NRPA, a new Monte-Carlo algorithm designed for multi-objective optimization problems over discrete search spaces. Extending the Nested Rollout Policy Adaptation (NRPA) algorithm originally formulated for single-objective problems, Pareto-NRPA generalizes the nested search and policy update mechanism to multi-objective optimization. The algorithm uses a set of policies to concurrently explore different regions of the solution space and maintains non-dominated fronts at each level of search. Policy adaptation is performed with respect to the diversity and isolation of sequences within the Pareto front. We benchmark Pareto-NRPA on two classes of problems: a novel bi-objective variant of the Traveling Salesman Problem with Time Windows problem (MO-TSPTW), and a neural architecture search task on well-known benchmarks. Results demonstrate that Pareto-NRPA achieves competitive performance against state-of-the-art multi-objective algorithms, both in terms of convergence and diversity of solutions. Particularly, Pareto-NRPA strongly outperforms state-of-the-art evolutionary multi-objective algorithms on constrained search spaces. To our knowledge, this work constitutes the first adaptation of NRPA to the multi-objective setting.
comment: Accepted as a conference paper to ECAI 2025
♻ ☆ Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
comment: 82 pages
♻ ☆ Neuromorphic Intelligence
Neuromorphic computing seeks to replicate the remarkable efficiency, flexibility, and adaptability of the human brain in artificial systems. Unlike conventional digital approaches, which suffer from the Von Neumann bottleneck and depend on massive computational and energy resources, neuromorphic systems exploit brain-inspired principles of computation to achieve orders of magnitude greater energy efficiency. By drawing on insights from a wide range of disciplines -- including artificial intelligence, physics, chemistry, biology, neuroscience, cognitive science and materials science -- neuromorphic computing promises to deliver intelligent systems that are sustainable, transparent, and widely accessible. A central challenge, however, is to identify a unifying theoretical framework capable of bridging these diverse disciplines. We argue that dynamical systems theory provides such a foundation. Rooted in differential calculus, it offers a principled language for modeling inference, learning, and control in both natural and artificial substrates. Within this framework, noise can be harnessed as a resource for learning, while differential genetic programming enables the discovery of dynamical systems that implement adaptive behaviors. Embracing this perspective paves the way toward emergent neuromorphic intelligence, where intelligent behavior arises from the dynamics of physical substrates, advancing both the science and sustainability of AI.
comment: 16 pages, 2 figures, 3 boxes
♻ ☆ REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning NeurIPS 2025
Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.
comment: accepted to NeurIPS 2025
♻ ☆ CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.
♻ ☆ HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their complexity poses significant challenges to traditional data extraction, processing, and querying methods. While current solutions focus on transforming these tables into relational formats for SQL queries, they fall short in handling the diverse and complex layouts of HCTs and hence being amenable to querying. This paper describes HCT-QA, an extensive benchmark of HCTs, natural language queries, and related answers on thousands of tables. Our dataset includes 2,188 real-world HCTs with 9,835 QA pairs and 4,679 synthetic tables with 67.5K QA pairs. While HCTs can be potentially processed by different type of query engines, in this paper, we focus on Large Language Models as potential engines and assess their ability in processing and querying such tables.
♻ ☆ Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse
Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information media. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague this novel technology. This study employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing one of the aforementioned metrics. The results show an amplified liberal-authoritarian alignment across the six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on the region's pre-existing socio-political structures.
comment: 19 pages, 7 figures
♻ ☆ Ocean Wave Forecasting with Deep Learning as Alternative to Conventional Models
This study presents OceanCastNet (OCN), a machine learning approach for wave forecasting that incorporates wind and wave fields to predict significant wave height, mean wave period, and mean wave direction.We evaluate OCN's performance against the operational ECWAM model using two independent datasets: NDBC buoy and Jason-3 satellite observations. NDBC station validation indicates OCN performs better at 24 stations compared to ECWAM's 10 stations, and Jason-3 satellite validation confirms similar accuracy across 228-hour forecasts. OCN successfully captures wave patterns during extreme weather conditions, demonstrated through Typhoon Goni with prediction errors typically within $\pm$0.5 m. The approach also offers computational efficiency advantages. The results suggest that machine learning approaches can achieve performance comparable to conventional wave forecasting systems for operational wave prediction applications.
comment: Accepted manuscript. Final version published in Journal of Advances in Modeling Earth Systems: https://doi.org/10.1029/2025MS005285
♻ ☆ Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
comment: 10 pages + appendices
♻ ☆ Experience-Driven Exploration for Efficient API-Free AI Agents
Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent's raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
♻ ☆ Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access NeurIPS 2025
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.
comment: Accepted to NeurIPS 2025
♻ ☆ Enhancing Sequential Model Performance with Squared Sigmoid TanH (SST) Activation Under Data Constraints
Activation functions enable neural networks to learn complex representations by introducing non-linearities. While feedforward models commonly use rectified linear units, sequential models like recurrent neural networks, long short-term memory (LSTMs) and gated recurrent units (GRUs) still rely on Sigmoid and TanH activation functions. However, these classical activation functions often struggle to model sparse patterns when trained on small sequential datasets to effectively capture temporal dependencies. To address this limitation, we propose squared Sigmoid TanH (SST) activation specifically tailored to enhance the learning capability of sequential models under data constraints. SST applies mathematical squaring to amplify differences between strong and weak activations as signals propagate over time, facilitating improved gradient flow and information filtering. We evaluate SST-powered LSTMs and GRUs for diverse applications, such as sign language recognition, regression, and time-series classification tasks, where the dataset is limited. Our experiments demonstrate that SST models consistently outperform RNN-based models with baseline activations, exhibiting improved test accuracy.
comment: 10 pages,9 figures
♻ ☆ Tool and Tutor? Experimental evidence from AI deployment in cancer diagnosis
Numerous countries globally face shortages of medical experts, deepening inequalities in access to healthcare. Artificial Intelligence (AI)-based diagnostic tools hold considerable promise to tackle this challenge by enabling even novices to deliver expert-level medical services. However, reliance on AI for task completion may hinder the learning required for novices to develop expertise. We thus explore whether AI-based diagnostic tools can be used to enhance not only performance but also learning in the context of lung cancer diagnosis. We examine the distinct effects of AI input during training (i.e., learning how to diagnose) versus in practice (i.e., completing diagnostic tasks) on novice medical professionals' performance. In two field experiments, 576 medical students were randomly assigned across conditions, manipulating the access to AI input during their training, during a test of their diagnostic capabilities, or both. During practice, participants diagnosed potential lung cancer cases using chest CT scans, and their diagnoses were evaluated against the ground truth obtained through histopathological examinations. Study 1 (N = 336) revealed that AI input in training alone improved human diagnostic accuracy by 3.2 percentage points over the control, while AI input during practice alone increased human accuracy by 7.9 percentage points. Combined deployment in both training and practice yielded an improvement of 13.7 percentage points--significantly exceeding either approach alone. Study 2 (N = 240) showed that AI input in practice alone improved accuracy in subsequent practice, unaided by AI, by 9.9 percentage points over the control. Even minimally informative AI input in training improved diagnostic accuracy by 5.3 percentage points over the control. These results reveal AI's dual role: As a tool, it could rapidly improve novices' performance.
♻ ☆ Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation
LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.
♻ ☆ Music Arena: Live Evaluation for Text-to-Music NeurIPS 2025
We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering *live* evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of *detailed* preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org . Preference data is available at: https://huggingface.co/music-arena .
comment: NeurIPS 2025 Creative AI Track
♻ ☆ PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models
The rapid advancement of generative AI has provided users with a wide range of well-trained models to address diverse prompts. When selecting a model for a given prompt, users should weigh not only its performance but also its service cost. However, existing model-selection methods typically emphasize performance while overlooking cost differences. In this paper, we introduce PromptWise, an online learning framework that assigns prompts to generative models in a cost-aware manner. PromptWise estimates prompt-model compatibility to select the least expensive model expected to deliver satisfactory outputs. Unlike standard contextual bandits that make a one-shot decision per prompt, PromptWise employs a cost-aware bandit structure that allows sequential model assignments per prompt to reduce total service cost. Through numerical experiments on tasks such as code generation and translation, we demonstrate that PromptWise can achieve performance comparable to baseline selection methods while incurring substantially lower costs. The code is available at: github.com/yannxiaoyanhu/PromptWise.
comment: 39 pages
♻ ☆ SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting CIKM 2025
Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.
comment: CIKM 2025
♻ ☆ Federated Vision-Language-Recommendation with Personalized Fusion
Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.
comment: 15 pages, 10 figures, 7 tables, conference
Machine Learning 55
♻ ☆ From Epilepsy Seizures Classification to Detection: A Deep Learning-based Approach for Raw EEG Signals
Epilepsy represents the most prevalent neurological disease in the world. One-third of people suffering from mesial temporal lobe epilepsy (MTLE) exhibit drug resistance, urging the need to develop new treatments. A key part in anti-seizure medication (ASM) development is the capability of detecting and quantifying epileptic seizures occurring in electroencephalogram (EEG) signals, which is crucial for treatment efficacy evaluation. In this study, we introduced a seizure detection pipeline based on deep learning models applied to raw EEG signals. This pipeline integrates: a new pre-processing technique which segments continuous raw EEG signals without prior distinction between seizure and seizure-free activities; a post-processing algorithm developed to reassemble EEG segments and allow the identification of seizures start/end; and finally, a new evaluation procedure based on a strict seizure events comparison between predicted and real labels. Models training have been performed using a data splitting strategy which addresses the potential for data leakage. We demonstrated the fundamental differences between a seizure classification and a seizure detection task and showed the differences in performance between the two tasks. Finally, we demonstrated the generalization capabilities across species of our best architecture, combining a Convolutional Neural Network and a Transformer encoder. The model was trained on animal EEGs and tested on human EEGs with a F1-score of 93% on a balanced Bonn dataset.
comment: 25 pages, 3 tables, 5 figures
♻ ☆ An Effective Flow-based Method for Positive-Unlabeled Learning: 2-HNC
In many scenarios of binary classification, only positive instances are provided in the training data, leaving the rest of the data unlabeled. This setup, known as positive-unlabeled (PU) learning, is addressed here with a network flow-based method which utilizes pairwise similarities between samples. The method we propose here, 2-HNC, leverages Hochbaum's Normalized Cut (HNC) and the set of solutions it provides by solving a parametric minimum cut problem. The set of solutions, that are nested partitions of the samples into two sets, correspond to varying tradeoff values between the two goals: high intra-similarity inside the sets and low inter-similarity between the two sets. This nested sequence is utilized here to deliver a ranking of unlabeled samples by their likelihood of being negative. Building on this insight, our method, 2-HNC, proceeds in two stages. The first stage generates this ranking without assuming any negative labels, using a problem formulation that is constrained only on positive labeled samples. The second stage augments the positive set with likely-negative samples and recomputes the classification. The final label prediction selects among all generated partitions in both stages, the one that delivers a positive class proportion, closest to a prior estimate of this quantity, which is assumed to be given. Extensive experiments across synthetic and real datasets show that 2-HNC yields strong performance and often surpasses existing state-of-the-art algorithms.
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ Multi-Step Reasoning with Large Language Models, a Survey
Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.
comment: ACM Computing Surveys
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials
We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Further, we incorporate a spike-and-slab structure that identifies which interactions are relevant for inclusion in the linear predictor, even when they form a subset of the available variables. Simulation experiments show that our method outperforms previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables, whilst accounting for the uncertainty in the posterior distribution.
♻ ☆ Multi-head Temporal Latent Attention NeurIPS 2025
While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.
comment: Accepted by NeurIPS 2025
♻ ☆ Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization
Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a compelling foundation for designing structured methods in NN optimization.
comment: improved the main text, working in progress
♻ ☆ MistralBSM: Leveraging Mistral-7B for Vehicular Networks Misbehavior Detection
Malicious attacks on vehicular networks pose a serious threat to road safety as well as communication reliability. A major source of these threats stems from misbehaving vehicles within the network. To address this challenge, we propose a Large Language Model (LLM)-empowered Misbehavior Detection System (MDS) within an edge-cloud detection framework. Specifically, we fine-tune Mistral-7B, a compact and high-performing LLM, to detect misbehavior based on Basic Safety Messages (BSM) sequences as the edge component for real-time detection, while a larger LLM deployed in the cloud validates and reinforces the edge model's detection through a more comprehensive analysis. By updating only 0.012% of the model parameters, our model, which we named MistralBSM, achieves 98% accuracy in binary classification and 96% in multiclass classification on a selected set of attacks from VeReMi dataset, outperforming LLAMA2-7B and RoBERTa. Our results validate the potential of LLMs in MDS, showing a significant promise in strengthening vehicular network security to better ensure the safety of road users.
♻ ☆ Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
comment: 144 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website
♻ ☆ Variational Inference in Location-Scale Families: Exact Recovery of the Mean and Correlation Matrix
Given an intractable target density $p$, variational inference (VI) attempts to find the best approximation $q$ from a tractable family $Q$. This is typically done by minimizing the exclusive Kullback-Leibler divergence, $\text{KL}(q||p)$. In practice, $Q$ is not rich enough to contain $p$, and the approximation is misspecified even when it is a unique global minimizer of $\text{KL}(q||p)$. In this paper, we analyze the robustness of VI to these misspecifications when $p$ exhibits certain symmetries and $Q$ is a location-scale family that shares these symmetries. We prove strong guarantees for VI not only under mild regularity conditions but also in the face of severe misspecifications. Namely, we show that (i) VI recovers the mean of $p$ when $p$ exhibits an \textit{even} symmetry, and (ii) it recovers the correlation matrix of $p$ when in addition~$p$ exhibits an \textit{elliptical} symmetry. These guarantees hold for the mean even when $q$ is factorized and $p$ is not, and for the correlation matrix even when~$q$ and~$p$ behave differently in their tails. We analyze various regimes of Bayesian inference where these symmetries are useful idealizations, and we also investigate experimentally how VI behaves in their absence.
♻ ☆ A Free Probabilistic Framework for Denoising Diffusion Models: Entropy, Transport, and Reverse Processes
This paper develops a rigorous probabilistic framework that extends denoising diffusion models to the setting of noncommutative random variables. Building on Voiculescu's theory of free entropy and free Fisher information, we formulate diffusion and reverse processes governed by operator-valued stochastic dynamics whose spectral measures evolve by additive convolution. Using tools from free stochastic analysis -- including a Malliavin calculus and a Clark--Ocone representation -- we derive the reverse-time stochastic differential equation driven by the conjugate variable, the analogue of the classical score function. The resulting dynamics admit a gradient-flow structure in the noncommutative Wasserstein space, establishing an information-geometric link between entropy production, transport, and deconvolution. We further construct a variational scheme analogous to the Jordan--Kinderlehrer--Otto (JKO) formulation and prove convergence toward the semicircular equilibrium. The framework provides functional inequalities (free logarithmic Sobolev, Talagrand, and HWI) that quantify entropy dissipation and Wasserstein contraction. These results unify diffusion-based generative modeling with the geometry of operator-valued information, offering a mathematical foundation for generative learning on structured and high-dimensional data.
♻ ☆ TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning NeurIPS 2025
In-context learning, the ability of large language models to perform tasks using only examples provided in the prompt, has recently been adapted for time series forecasting. This paradigm enables zero-shot prediction, where past values serve as context for forecasting future values, making powerful forecasting tools accessible to non-experts and increasing the performance when training data are scarce. Most existing zero-shot forecasting approaches rely on transformer architectures, which, despite their success in language, often fall short of expectations in time series forecasting, where recurrent models like LSTMs frequently have the edge. Conversely, while LSTMs are well-suited for time series modeling due to their state-tracking capabilities, they lack strong in-context learning abilities. We introduce TiRex that closes this gap by leveraging xLSTM, an enhanced LSTM with competitive in-context learning skills. Unlike transformers, state-space models, or parallelizable RNNs such as RWKV, TiRex retains state-tracking, a critical property for long-horizon forecasting. To further facilitate its state-tracking ability, we propose a training-time masking strategy called CPM. TiRex sets a new state of the art in zero-shot time series forecasting on the HuggingFace benchmarks GiftEval and Chronos-ZS, outperforming significantly larger models including TabPFN-TS (Prior Labs), Chronos Bolt (Amazon), TimesFM (Google), and Moirai (Salesforce) across both short- and long-term forecasts.
comment: presented at NeurIPS 2025
♻ ☆ How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
♻ ☆ Solving Inequality Proofs with Large Language Models NeurIPS 2025
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
comment: 50 pages, 24 figures, accepted as a Spotlight at NeurIPS 2025
♻ ☆ Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48\% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://github.com/Skilteee/DiZO.
♻ ☆ Differential privacy guarantees of Markov chain Monte Carlo algorithms
This paper aims to provide differential privacy (DP) guarantees for Markov chain Monte Carlo (MCMC) algorithms. In a first part, we establish DP guarantees on samples output by MCMC algorithms as well as Monte Carlo estimators associated with these methods under assumptions on the convergence properties of the underlying Markov chain. In particular, our results highlight the critical condition of ensuring the target distribution is differentially private itself. In a second part, we specialise our analysis to the unadjusted Langevin algorithm and stochastic gradient Langevin dynamics and establish guarantees on their (R\'enyi) DP. To this end, we develop a novel methodology based on Girsanov's theorem combined with a perturbation trick to obtain bounds for an unbounded domain and in a non-convex setting. We establish: (i) uniform in $n$ privacy guarantees when the state of the chain after $n$ iterations is released, (ii) bounds on the privacy of the entire chain trajectory. These findings provide concrete guidelines for privacy-preserving MCMC.
♻ ☆ Over-squashing in Spatiotemporal Graph Neural Networks NeurIPS 2025
Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.
comment: Accepted at NeurIPS 2025
♻ ☆ Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users
Deep learning models are widely used in decision-making and recommendation systems, where they typically rely on the assumption of a static data distribution between training and deployment. However, real-world deployment environments often violate this assumption. Users who receive negative outcomes may adapt their features to meet model criteria, i.e., recourse action. These adaptive behaviors create shifts in the data distribution and when models are retrained on this shifted data, a feedback loop emerges: user behavior influences the model, and the updated model in turn reshapes future user behavior. Despite its importance, this bidirectional interaction between users and models has received limited attention. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Both the theoretical and empirical analyses show that user recourse behavior tends to push logistic and MLP models toward increasingly higher decision standards, resulting in higher recourse costs and less reliable recourse actions over time. To mitigate these challenges, we propose two methods--Fair-top-k and Dynamic Continual Learning (DCL)--which significantly reduce recourse cost and improve model robustness. Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
comment: 13 pages,4 figures, 3 tables
♻ ☆ Exploring the limits of strong membership inference attacks on large language models NeurIPS 2025
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.
comment: NeurIPS 2025
♻ ☆ Optimizing Token Choice for Code Watermarking: An RL Approach
Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality.
comment: 18 pages, 3 figures
♻ ☆ Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models
Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.
comment: 17 pages, 4 figures
♻ ☆ Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges
We are delighted to see the recent development of physics-informed extreme learning machine (PIELM) for its higher computational efficiency and accuracy compared to other physics-informed machine learning (PIML) paradigms. Since a comprehensive summary or review of PIELM is currently unavailable, we would like to take this opportunity to share our perspectives and experiences on this promising research direction. We can see that many efforts have been made to solve ordinary/partial differential equations (ODEs/PDEs) characterized by sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling, and interpretability. Despite these encouraging successes, many pressing challenges remain to be tackled, which also provides opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for scientific and engineering applications.
♻ ☆ ORFit: One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
comment: Journal extension of v1: Y. Min, K, Ahn, N. Azizan, "One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares," IEEE Conference on Decision and Control, 2022
♻ ☆ Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving ICRA 2025
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
comment: 8 pages, 3 figures, ICRA 2025
♻ ☆ Gymnasium: A Standard Interface for Reinforcement Learning Environments NeurIPS
Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium
comment: Accepted at NeurIPS Datasets and Benchmarks 2025
♻ ☆ Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing
This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.
comment: This article has been prepared to be submitted to the Fundamenta Informaticae Journal
♻ ☆ Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
comment: Preprint
♻ ☆ Hyper-Transforming Latent Diffusion Models
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming: a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining. We validate our approach across multiple modalities, demonstrating improved scalability, expressiveness, and generalization over existing INR-based generative models. Our findings establish a unified and flexible framework for learning structured function representations.
♻ ☆ REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning NeurIPS 2025
Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.
comment: accepted to NeurIPS 2025
♻ ☆ Computational Budget Should Be Considered in Data Selection
Data selection improves computational efficiency by choosing informative subsets of training samples. However, existing methods ignore the compute budget, treating data selection and importance evaluation independently of compute budget constraints. Yet empirical studies show no algorithm can consistently outperform others (or even random selection) across varying budgets. We therefore argue that compute budget must be integral to data-selection strategies, since different budgets impose distinct requirements on data quantity, quality, and distribution for effective training. To this end, we propose a novel Computational budget-Aware Data Selection (CADS) method and naturally formulate it into a bilevel optimization framework, where the inner loop trains the model within the constraints of the computational budget on some selected subset of training data, while the outer loop optimizes data selection based on model evaluation. Our technical contributions lie in addressing two main challenges in solving this bilevel optimization problem: the expensive Hessian matrix estimation for outer-loop gradients and the computational burden of achieving inner-loop optimality during iterations. To solve the first issue, we propose a probabilistic reparameterization strategy and compute the gradient using a Hessian-free policy gradient estimator. To address the second challenge, we transform the inner optimization problem into a penalty term in the outer objective, further discovering that we only need to estimate the minimum of a one-dimensional loss to calculate the gradient, significantly improving efficiency. Extensive experiments show that our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.
♻ ☆ Let's Grow an Unbiased Community: Guiding the Fairness of Graphs via New Links
Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications. However, due to the biases in the graph structures, graph neural networks face significant challenges in fairness. Although the original user graph structure is generally biased, it is promising to guide these existing structures toward unbiased ones by introducing new links. The fairness guidance via new links could foster unbiased communities, thereby enhancing fairness in downstream applications. To address this issue, we propose a novel framework named FairGuide. Specifically, to ensure fairness in downstream tasks trained on fairness-guided graphs, we introduce a differentiable community detection task as a pseudo downstream task. Our theoretical analysis further demonstrates that optimizing fairness within this pseudo task effectively enhances structural fairness, promoting fairness generalization across diverse downstream applications. Moreover, FairGuide employs an effective strategy which leverages meta-gradients derived from the fairness-guidance objective to identify new links that significantly enhance structural fairness. Extensive experimental results demonstrate the effectiveness and generalizability of our proposed method across a variety of graph-based fairness tasks.
♻ ☆ Ocean Wave Forecasting with Deep Learning as Alternative to Conventional Models
This study presents OceanCastNet (OCN), a machine learning approach for wave forecasting that incorporates wind and wave fields to predict significant wave height, mean wave period, and mean wave direction.We evaluate OCN's performance against the operational ECWAM model using two independent datasets: NDBC buoy and Jason-3 satellite observations. NDBC station validation indicates OCN performs better at 24 stations compared to ECWAM's 10 stations, and Jason-3 satellite validation confirms similar accuracy across 228-hour forecasts. OCN successfully captures wave patterns during extreme weather conditions, demonstrated through Typhoon Goni with prediction errors typically within $\pm$0.5 m. The approach also offers computational efficiency advantages. The results suggest that machine learning approaches can achieve performance comparable to conventional wave forecasting systems for operational wave prediction applications.
comment: Accepted manuscript. Final version published in Journal of Advances in Modeling Earth Systems: https://doi.org/10.1029/2025MS005285
♻ ☆ Information-Theoretic Framework for Understanding Modern Machine-Learning
We introduce an information-theoretic framework that views learning as universal prediction under log loss, characterized through regret bounds. Central to the framework is an effective notion of architecture-based model complexity, defined by the probability mass or volume of models in the vicinity of the data-generating process, or its projection on the model class. This volume is related to spectral properties of the expected Hessian or the Fisher Information Matrix, leading to tractable approximations. We argue that successful architectures possess a broad complexity range, enabling learning in highly over-parameterized model classes. The framework sheds light on the role of inductive biases, the effectiveness of stochastic gradient descent, and phenomena such as flat minima. It unifies online, batch, supervised, and generative settings, and applies across the stochastic-realizable and agnostic regimes. Moreover, it provides insights into the success of modern machine-learning architectures, such as deep neural networks and transformers, suggesting that their broad complexity range naturally arises from their layered structure. These insights open the door to the design of alternative architectures with potentially comparable or even superior performance.
♻ ☆ When Models Don't Collapse: On the Consistency of Iterative MLE
The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.
♻ ☆ Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
comment: 10 pages + appendices
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
♻ ☆ Disentangled Lottery Tickets: Identifying and Assembling Core and Specialist Subnetworks
The Lottery Ticket Hypothesis (LTH) suggests that within large neural networks, there exist sparse, trainable "winning tickets" capable of matching the performance of the full model, but identifying them through Iterative Magnitude Pruning (IMP) is computationally expensive. Recent work introduced COLT, an accelerator that discovers a "consensus" subnetwork by intersecting masks from models trained on disjoint data partitions; however, this approach discards all non-overlapping weights, assuming they are unimportant. This paper challenges that assumption and proposes the Disentangled Lottery Ticket (DiLT) Hypothesis, which posits that the intersection mask represents a universal, task-agnostic "core" subnetwork, while the non-overlapping difference masks capture specialized, task-specific "specialist" subnetworks. A framework is developed to identify and analyze these components using the Gromov-Wasserstein (GW) distance to quantify functional similarity between layer representations and reveal modular structures through spectral clustering. Experiments on ImageNet and fine-grained datasets such as Stanford Cars, using ResNet and Vision Transformer architectures, show that the "core" ticket provides superior transfer learning performance, the "specialist" tickets retain domain-specific features enabling modular assembly, and the full re-assembled "union" ticket outperforms COLT - demonstrating that non-consensus weights play a critical functional role. This work reframes pruning as a process for discovering modular, disentangled subnetworks rather than merely compressing models.
♻ ☆ Characterization and Learning of Causal Graphs from Hard Interventions NeurIPS 2025
A fundamental challenge in the empirical sciences involves uncovering causal structure through observation and experimentation. Causal discovery entails linking the conditional independence (CI) invariances in observational data to their corresponding graphical constraints via d-separation. In this paper, we consider a general setting where we have access to data from multiple experimental distributions resulting from hard interventions, as well as potentially from an observational distribution. By comparing different interventional distributions, we propose a set of graphical constraints that are fundamentally linked to Pearl's do-calculus within the framework of hard interventions. These graphical constraints associate each graphical structure with a set of interventional distributions that are consistent with the rules of do-calculus. We characterize the interventional equivalence class of causal graphs with latent variables and introduce a graphical representation that can be used to determine whether two causal graphs are interventionally equivalent, i.e., whether they are associated with the same family of hard interventional distributions, where the elements of the family are indistinguishable using the invariances from do-calculus. We also propose a learning algorithm to integrate multiple datasets from hard interventions, introducing new orientation rules. The learning objective is a tuple of augmented graphs which entails a set of causal graphs. We also prove the soundness of the proposed algorithm.
comment: Accepted at NeurIPS 2025
♻ ☆ Lyapunov Neural ODE State-Feedback Control Policies
Deep neural networks are increasingly used as an effective parameterization of control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a target state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy, bridging the gap of solving continuous-time OCPs via NODEs with stability guarantees. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine. In both cases, L-NODEC effectively stabilizes the controlled system around the target state despite perturbations to the initial state and reduces the inference time necessary to reach the target.
♻ ☆ FairAIED: Navigating Fairness, Bias, and Ethics in Educational AI Applications
The integration of AI in education holds immense potential for personalizing learning experiences and transforming instructional practices. However, AI systems can inadvertently encode and amplify biases present in educational data, leading to unfair or discriminatory outcomes. As researchers have sought to understand and mitigate these biases, a growing body of work has emerged examining fairness in educational AI. These studies, though expanding rapidly, remain fragmented due to differing assumptions, methodologies, and application contexts. Moreover, existing surveys either focus on algorithmic fairness without an educational setting or emphasize educational methods while overlooking fairness. To this end, this survey provides a comprehensive systematic review of algorithmic fairness within educational AI, explicitly bridging the gap between technical fairness research and educational applications. We integrate multiple dimensions, including bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerations, into a harmonized, education-centered framework. In addition, we explicitly examine practical challenges such as censored or partially observed learning outcomes and the persistent difficulty in quantifying and managing the trade-off between fairness and predictive utility, enhancing the applicability of fairness frameworks to real-world educational AI systems. Finally, we outline an emerging pathway toward fair AI-driven education and by situating these technologies and practical insights within broader educational and ethical contexts, this review establishes a comprehensive foundation for advancing fairness, accountability, and inclusivity in the field of AI education.
♻ ☆ Enhancing Sequential Model Performance with Squared Sigmoid TanH (SST) Activation Under Data Constraints
Activation functions enable neural networks to learn complex representations by introducing non-linearities. While feedforward models commonly use rectified linear units, sequential models like recurrent neural networks, long short-term memory (LSTMs) and gated recurrent units (GRUs) still rely on Sigmoid and TanH activation functions. However, these classical activation functions often struggle to model sparse patterns when trained on small sequential datasets to effectively capture temporal dependencies. To address this limitation, we propose squared Sigmoid TanH (SST) activation specifically tailored to enhance the learning capability of sequential models under data constraints. SST applies mathematical squaring to amplify differences between strong and weak activations as signals propagate over time, facilitating improved gradient flow and information filtering. We evaluate SST-powered LSTMs and GRUs for diverse applications, such as sign language recognition, regression, and time-series classification tasks, where the dataset is limited. Our experiments demonstrate that SST models consistently outperform RNN-based models with baseline activations, exhibiting improved test accuracy.
comment: 10 pages,9 figures
♻ ☆ APALU: A Trainable, Adaptive Activation Function for Deep Learning Networks
Activation function is a pivotal component of deep learning, facilitating the extraction of intricate data patterns. While classical activation functions like ReLU and its variants are extensively utilized, their static nature and simplicity, despite being advantageous, often limit their effectiveness in specialized tasks. The trainable activation functions also struggle sometimes to adapt to the unique characteristics of the data. Addressing these limitations, we introduce a novel trainable activation function, adaptive piecewise approximated activation linear unit (APALU), to enhance the learning performance of deep learning across a broad range of tasks. It presents a unique set of features that enable it to maintain stability and efficiency in the learning process while adapting to complex data representations. Experiments reveal significant improvements over widely used activation functions for different tasks. In image classification, APALU increases MobileNet and GoogleNet accuracy by 0.37% and 0.04%, respectively, on the CIFAR10 dataset. In anomaly detection, it improves the average area under the curve of One-CLASS Deep SVDD by 0.8% on the MNIST dataset, 1.81% and 1.11% improvements with DifferNet, and knowledge distillation, respectively, on the MVTech dataset. Notably, APALU achieves 100% accuracy on a sign language recognition task with a limited dataset. For regression tasks, APALU enhances the performance of deep neural networks and recurrent neural networks on different datasets. These improvements highlight the robustness and adaptability of APALU across diverse deep-learning applications.
comment: 9 pages, 4 figures
♻ ☆ Tool and Tutor? Experimental evidence from AI deployment in cancer diagnosis
Numerous countries globally face shortages of medical experts, deepening inequalities in access to healthcare. Artificial Intelligence (AI)-based diagnostic tools hold considerable promise to tackle this challenge by enabling even novices to deliver expert-level medical services. However, reliance on AI for task completion may hinder the learning required for novices to develop expertise. We thus explore whether AI-based diagnostic tools can be used to enhance not only performance but also learning in the context of lung cancer diagnosis. We examine the distinct effects of AI input during training (i.e., learning how to diagnose) versus in practice (i.e., completing diagnostic tasks) on novice medical professionals' performance. In two field experiments, 576 medical students were randomly assigned across conditions, manipulating the access to AI input during their training, during a test of their diagnostic capabilities, or both. During practice, participants diagnosed potential lung cancer cases using chest CT scans, and their diagnoses were evaluated against the ground truth obtained through histopathological examinations. Study 1 (N = 336) revealed that AI input in training alone improved human diagnostic accuracy by 3.2 percentage points over the control, while AI input during practice alone increased human accuracy by 7.9 percentage points. Combined deployment in both training and practice yielded an improvement of 13.7 percentage points--significantly exceeding either approach alone. Study 2 (N = 240) showed that AI input in practice alone improved accuracy in subsequent practice, unaided by AI, by 9.9 percentage points over the control. Even minimally informative AI input in training improved diagnostic accuracy by 5.3 percentage points over the control. These results reveal AI's dual role: As a tool, it could rapidly improve novices' performance.
♻ ☆ Distributionally Robust Optimization with Adversarial Data Contamination
Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an $\epsilon$-fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of $O(\sqrt{\epsilon})$ for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
♻ ☆ On the Variance, Admissibility, and Stability of Empirical Risk Minimization
It is well known that Empirical Risk Minimization (ERM) may attain minimax suboptimal rates in terms of the mean squared error (Birg\'e and Massart, 1993). In this paper, we prove that, under relatively mild assumptions, the suboptimality of ERM must be due to its large bias. Namely, the variance error term of ERM is bounded by the minimax rate. In the fixed design setting, we provide an elementary proof of this result using the probabilistic method. Then, we extend our proof to the random design setting for various models. In addition, we provide a simple proof of Chatterjee's admissibility theorem (Chatterjee, 2014, Theorem 1.4), which states that in the fixed design setting, ERM cannot be ruled out as an optimal method, and then we extend this result to the random design setting. We also show that our estimates imply the stability of ERM, complementing the main result of Caponnetto and Rakhlin (2006) for non-Donsker classes. Finally, we highlight the somewhat irregular nature of the loss landscape of ERM in the non-Donsker regime, by showing that functions can be close to ERM, in terms of $L_2$ distance, while still being far from almost-minimizers of the empirical loss.
comment: Appeared at Neurips 2023 (spotlight)
♻ ☆ AI-Powered Energy Algorithmic Trading: Integrating Hidden Markov Models with Neural Networks
In quantitative finance, machine learning methods are essential for alpha generation. This study introduces a new approach that combines Hidden Markov Models (HMM) and neural networks, integrated with Black-Litterman portfolio optimization. During the COVID period (2019-2022), this dual-model approach achieved a 83% return with a Sharpe ratio of 0.77. It incorporates two risk models to enhance risk management, showing efficiency during volatile periods. The methodology was implemented on the QuantConnect platform, which was chosen for its robust framework and experimental reproducibility. The system, which predicts future price movements, includes a three-year warm-up to ensure proper algorithm function. It targets highly liquid, large-cap energy stocks to ensure stable and predictable performance while also considering broker payments. The dual-model alpha system utilizes log returns to select the optimal state based on the historical performance. It combines state predictions with neural network outputs, which are based on historical data, to generate trading signals. This study examined the architecture of the trading system, data pre-processing, training, and performance. The full code and backtesting data are available under the QuantConnect terms: https://github.com/tiagomonteiro0715/AI-Powered-Energy-Algorithmic-Trading-Integrating-Hidden-Markov-Models-with-Neural-Networks
comment: 14 pages, 4 figures, 2 tables
♻ ☆ Wait-Less Offline Tuning and Re-solving for Online Decision Making
Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. In contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from worse regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP methods. The algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In addition, a first-order method runs in parallel during each interval between LP re-solves, smoothing resource consumption. Our algorithm achieves $\mathscr{O}(\log (T/f) + \sqrt{f})$ regret, delivering a "wait-less" online decision-making process that balances the computational efficiency of first-order methods and the superior regret guarantee of LP-based methods.
comment: In this version, we achieve a tighter regret bound with the warm start for the first batch. We also make the proof more elegant by manually accepting all subsequent orders once the constraint is violated. In this way, we do not need to introduce the concept of stopping time for the analysis of the LP-based method
♻ ☆ Mechanism Learning: reverse causal inference in the presence of multiple unknown confounding through causally weighted Gaussian mixture models
A major limitation of machine learning (ML) prediction models is that they recover associational, rather than causal, predictive relationships between variables. In high-stakes automation applications of ML this is problematic, as the model often learns spurious, non-causal associations. This paper proposes mechanism learning, a simple method which uses causally weighted Gaussian Mixture Models (CW-GMMs) to deconfound observational data such that any appropriate ML model is forced to learn predictive relationships between effects and their causes (reverse causal inference), despite the potential presence of multiple unknown and unmeasured confounding. Effect variables can be very high-dimensional, and the predictive relationship nonlinear, as is common in ML applications. This novel method is widely applicable, the only requirement is the existence of a set of mechanism variables mediating the cause (prediction target) and effect (feature data), which is independent of the (unmeasured) confounding variables. We test our method on fully synthetic, semi-synthetic and real-world datasets, demonstrating that it can discover reliable, unbiased, causal ML predictors where by contrast, the same ML predictor trained naively using classical supervised learning on the original observational data, is heavily biased by spurious associations. We provide code to implement the results in the paper, online.
comment: 15 pages, 7 figures
♻ ☆ PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models
The rapid advancement of generative AI has provided users with a wide range of well-trained models to address diverse prompts. When selecting a model for a given prompt, users should weigh not only its performance but also its service cost. However, existing model-selection methods typically emphasize performance while overlooking cost differences. In this paper, we introduce PromptWise, an online learning framework that assigns prompts to generative models in a cost-aware manner. PromptWise estimates prompt-model compatibility to select the least expensive model expected to deliver satisfactory outputs. Unlike standard contextual bandits that make a one-shot decision per prompt, PromptWise employs a cost-aware bandit structure that allows sequential model assignments per prompt to reduce total service cost. Through numerical experiments on tasks such as code generation and translation, we demonstrate that PromptWise can achieve performance comparable to baseline selection methods while incurring substantially lower costs. The code is available at: github.com/yannxiaoyanhu/PromptWise.
comment: 39 pages
♻ ☆ SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting CIKM 2025
Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.
comment: CIKM 2025
♻ ☆ Distributionally Robust Wireless Semantic Communication with Large AI Models
Semantic communication (SemCom) has emerged as a promising paradigm for 6G wireless systems by transmitting task-relevant information rather than raw bits, yet existing approaches remain vulnerable to dual sources of uncertainty: semantic misinterpretation arising from imperfect feature extraction and transmission-level perturbations from channel noise. Current deep learning based SemCom systems typically employ domain-specific architectures that lack robustness guarantees and fail to generalize across diverse noise conditions, adversarial attacks, and out-of-distribution data. In this paper, a novel and generalized semantic communication framework called WaSeCom is proposed to systematically address uncertainty and enhance robustness. In particular, Wasserstein distributionally robust optimization is employed to provide resilience against semantic misinterpretation and channel perturbations. A rigorous theoretical analysis is performed to establish the robust generalization guarantees of the proposed framework. Experimental results on image and text transmission demonstrate that WaSeCom achieves improved robustness under noise and adversarial perturbations. These results highlight its effectiveness in preserving semantic fidelity across varying wireless conditions.
comment: Under Review
♻ ☆ Hybrid-Task Meta-Learning: A GNN Approach for Scalable and Transferable Bandwidth Allocation
In this paper, we develop a deep learning-based bandwidth allocation policy that is: 1) scalable with the number of users and 2) transferable to different communication scenarios, such as non-stationary wireless channels, different quality-of-service (QoS) requirements, and dynamically available resources. To support scalability, the bandwidth allocation policy is represented by a graph neural network (GNN), with which the number of training parameters does not change with the number of users. To enable the generalization of the GNN, we develop a hybrid-task meta-learning (HML) algorithm that trains the initial parameters of the GNN with different communication scenarios during meta-training. Next, during meta-testing, a few samples are used to fine-tune the GNN with unseen communication scenarios. Simulation results demonstrate that our HML approach can improve the initial performance by 8.79%, and sample efficiency by 73%, compared with existing benchmarks. After fine-tuning, our near-optimal GNN-based policy can achieve close to the same reward with much lower inference complexity compared to the optimal policy obtained using iterative optimization. Numerical results validate that our HML can reduce the computation time by approximately 200 to 2000 times than the optimal iterative algorithm.
♻ ☆ Federated Vision-Language-Recommendation with Personalized Fusion
Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.
comment: 15 pages, 10 figures, 7 tables, conference
Graphics 6
☆ Detecting Coverage Holes in Wireless Sensor Networks Using Connected Component Labeling and Force-Directed Algorithms
Contour detection in Wireless Sensor Networks (WSNs) is crucial for tasks like energy saving and network optimization, especially in security and surveillance applications. Coverage holes, where data transmission is not achievable, are a significant issue caused by factors such as energy depletion and physical damage. Traditional methods for detecting these holes often suffer from inaccuracy, low processing speed, and high energy consumption, relying heavily on physical information like node coordinates and sensing range. To address these challenges, we propose a novel, coordinate-free coverage hole detection method using Connected Component Labeling (CCL) and Force-Directed (FD) algorithms, termed FD-CCL. This method does not require node coordinates or sensing range information. We also investigate Suzuki's Contour Tracing (CT) algorithm and compare its performance with CCL on various FD graphs. Our experiments demonstrate the effectiveness of FD-CCL in terms of processing time and accuracy. Simulation results confirm the superiority of FD-CCL in detecting and locating coverage holes in WSNs.
☆ G2rammar: Bilingual Grammar Modeling for Enhanced Text-attributed Graph Learning
Text-attributed graphs require models to effectively integrate both structural topology and semantic content. Recent approaches apply large language models to graphs by linearizing structures into token sequences through random walks. These methods create concise graph vocabularies to replace verbose natural language descriptions. However, they overlook a critical component that makes language expressive: grammar. In natural language, grammar assigns syntactic roles to words and defines their functions within sentences. Similarly, nodes in graphs play distinct structural roles as hubs, bridges, or peripheral members. Current graph language methods provide tokens without grammatical annotations to indicate these structural or semantic roles. This absence limits language models' ability to reason about graph topology effectively. We propose \textbf{G2rammar}, a bilingual grammar framework that explicitly encodes both structural and semantic grammar for text-attributed graphs. Structural grammar characterizes topological roles through centrality and neighborhood patterns. Semantic grammar captures content relationships through textual informativity. The framework implements two-stage learning with structural grammar pre-training followed by semantic grammar fine-tuning. Extensive experiments on real-world datasets demonstrate that G2rammar consistently outperforms competitive baselines by providing language models with the grammatical context needed to understand graph structures.
☆ GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.
☆ Empowering LLMs with Structural Role Inference for Zero-Shot Graph Learning
Large Language Models have emerged as a promising approach for graph learning due to their powerful reasoning capabilities. However, existing methods exhibit systematic performance degradation on structurally important nodes such as bridges and hubs. We identify the root cause of these limitations. Current approaches encode graph topology into static features but lack reasoning scaffolds to transform topological patterns into role-based interpretations. This limitation becomes critical in zero-shot scenarios where no training data establishes structure-semantics mappings. To address this gap, we propose DuoGLM, a training-free dual-perspective framework for structure-aware graph reasoning. The local perspective constructs relation-aware templates capturing semantic interactions between nodes and neighbors. The global perspective performs topology-to-role inference to generate functional descriptions of structural positions. These complementary perspectives provide explicit reasoning mechanisms enabling LLMs to distinguish topologically similar but semantically different nodes. Extensive experiments across eight benchmark datasets demonstrate substantial improvements. DuoGLM achieves 14.3\% accuracy gain in zero-shot node classification and 7.6\% AUC improvement in cross-domain transfer compared to existing methods. The results validate the effectiveness of explicit role reasoning for graph understanding with LLMs.
♻ ☆ MOSPA: Human Motion Generation Driven by Spatial Audio NeurIPS 2025
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
comment: NeurIPS 2025 (Spotlight)
♻ ☆ Improved visual-information-driven model for crowd simulation and its modular application
Data-driven crowd simulation models offer advantages in enhancing the accuracy and realism of simulations, and improving their generalizability is essential for promoting application. Current data-driven approaches are primarily designed for a single scenario, with very few models validated across more than two scenarios. It is still an open question to develop data-driven crowd simulation models with strong generalizibility. We notice that the key to addressing this challenge lies in effectively and accurately capturing the core common influential features that govern pedestrians' navigation across diverse scenarios. Particularly, we believe that visual information is one of the most dominant influencing features. In light of this, this paper proposes a data-driven model incorporating a refined visual information extraction method and exit cues to enhance generalizability. The proposed model is examined on four common fundamental modules: bottleneck, corridor, corner and T-junction. The evaluation results demonstrate that our model performs excellently across these scenarios, aligning with pedestrian movement in real-world experiments, and significantly outperforms the classical knowledge-driven model. Furthermore, we introduce a modular approach to apply our proposed model in composite scenarios, and the results regarding trajectories and fundamental diagrams indicate that our simulations closely match real-world patterns in the composite scenario. The research outcomes can provide inspiration for the development of data-driven crowd simulation models with high generalizability and advance the application of data-driven approaches.This work has been submitted to Elsevier for possible publication.
Computer Vision and Pattern Recognition 20
☆ Anatomically Constrained Transformers for Echocardiogram Analysis
Video transformers have recently demonstrated strong potential for echocardiogram (echo) analysis, leveraging self-supervised pre-training and flexible adaptation across diverse tasks. However, like other models operating on videos, they are prone to learning spurious correlations from non-diagnostic regions such as image backgrounds. To overcome this limitation, we propose the Video Anatomically Constrained Transformer (ViACT), a novel framework that integrates anatomical priors directly into the transformer architecture. ViACT represents a deforming anatomical structure as a point set and encodes both its spatial geometry and corresponding image patches into transformer tokens. During pre-training, ViACT follows a masked autoencoding strategy that masks and reconstructs only anatomical patches, enforcing that representation learning is focused on the anatomical region. The pre-trained model can then be fine-tuned for tasks localized to this region. In this work we focus on the myocardium, demonstrating the framework on echo analysis tasks such as left ventricular ejection fraction (EF) regression and cardiac amyloidosis (CA) detection. The anatomical constraint focuses transformer attention within the myocardium, yielding interpretable attention maps aligned with regions of known CA pathology. Moreover, ViACT generalizes to myocardium point tracking without requiring task-specific components such as correlation volumes used in specialized tracking networks.
☆ Epanechnikov nonparametric kernel density estimation based feature-learning in respiratory disease chest X-ray images
This study presents a novel method for diagnosing respiratory diseases using image data. It combines Epanechnikov's non-parametric kernel density estimation (EKDE) with a bimodal logistic regression classifier in a statistical-model-based learning scheme. EKDE's flexibility in modeling data distributions without assuming specific shapes and its adaptability to pixel intensity variations make it valuable for extracting key features from medical images. The method was tested on 13808 randomly selected chest X-rays from the COVID-19 Radiography Dataset, achieved an accuracy of 70.14%, a sensitivity of 59.26%, and a specificity of 74.18%, demonstrating moderate performance in detecting respiratory disease while showing room for improvement in sensitivity. While clinical expertise remains essential for further refining the model, this study highlights the potential of EKDE-based approaches to enhance diagnostic accuracy and reliability in medical imaging.
comment: 12 pages, 6 figures, 3 tables
☆ SliceVision-F2I: A Synthetic Feature-to-Image Dataset for Visual Pattern Representation on Network Slices
The emergence of 5G and 6G networks has established network slicing as a significant part of future service-oriented architectures, demanding refined identification methods supported by robust datasets. The article presents SliceVision-F2I, a dataset of synthetic samples for studying feature visualization in network slicing for next-generation networking systems. The dataset transforms multivariate Key Performance Indicator (KPI) vectors into visual representations through four distinct encoding methods: physically inspired mappings, Perlin noise, neural wallpapering, and fractal branching. For each encoding method, 30,000 samples are generated, each comprising a raw KPI vector and a corresponding RGB image at low-resolution pixels. The dataset simulates realistic and noisy network conditions to reflect operational uncertainties and measurement imperfections. SliceVision-F2I is suitable for tasks involving visual learning, network state classification, anomaly detection, and benchmarking of image-based machine learning techniques applied to network data. The dataset is publicly available and can be reused in various research contexts, including multivariate time series analysis, synthetic data generation, and feature-to-image transformations.
☆ GeoToken: Hierarchical Geolocalization of Images via Next Token Prediction ICDM
Image geolocalization, the task of determining an image's geographic origin, poses significant challenges, largely due to visual similarities across disparate locations and the large search space. To address these issues, we propose a hierarchical sequence prediction approach inspired by how humans narrow down locations from broad regions to specific addresses. Analogously, our model predicts geographic tokens hierarchically, first identifying a general region and then sequentially refining predictions to increasingly precise locations. Rather than relying on explicit semantic partitions, our method uses S2 cells, a nested, multiresolution global grid, and sequentially predicts finer-level cells conditioned on visual inputs and previous predictions. This procedure mirrors autoregressive text generation in large language models. Much like in language modeling, final performance depends not only on training but also on inference-time strategy. We investigate multiple top-down traversal methods for autoregressive sampling, incorporating techniques from test-time compute scaling used in language models. Specifically, we integrate beam search and multi-sample inference while exploring various selection strategies to determine the final output. This enables the model to manage uncertainty by exploring multiple plausible paths through the hierarchy. We evaluate our method on the Im2GPS3k and YFCC4k datasets against two distinct sets of baselines: those that operate without a Multimodal Large Language Model (MLLM) and those that leverage one. In the MLLM-free setting, our model surpasses other comparable baselines on nearly all metrics, achieving state-of-the-art performance with accuracy gains of up to 13.9%. When augmented with an MLLM, our model outperforms all baselines, setting a new state-of-the-art across all metrics. The source code is available at https://github.com/NNargesNN/GeoToken.
comment: Accepted to IEEE International Conference on Data Mining (ICDM) 2025
☆ T-MLA: A Targeted Multiscale Log--Exponential Attack Framework for Neural Image Compression
Neural image compression (NIC) has become the state-of-the-art for rate-distortion performance, yet its security vulnerabilities remain significantly less understood than those of classifiers. Existing adversarial attacks on NICs are often naive adaptations of pixel-space methods, overlooking the unique, structured nature of the compression pipeline. In this work, we propose a more advanced class of vulnerabilities by introducing T-MLA, the first targeted multiscale log--exponential attack framework. Our approach crafts adversarial perturbations in the wavelet domain by directly targeting the quality of the attacked and reconstructed images. This allows for a principled, offline attack where perturbations are strategically confined to specific wavelet subbands, maximizing distortion while ensuring perceptual stealth. Extensive evaluation across multiple state-of-the-art NIC architectures on standard image compression benchmarks reveals a large drop in reconstruction quality while the perturbations remain visually imperceptible. Our findings reveal a critical security flaw at the core of generative and content delivery pipelines.
comment: Submitted to Information Systems. Code will be released upon journal publication
☆ FastBoost: Progressive Attention with Dynamic Scaling for Efficient Deep Learning
We present FastBoost, a parameter-efficient neural architecture that achieves state-of-the-art performance on CIFAR benchmarks through a novel Dynamically Scaled Progressive Attention (DSPA) mechanism. Our design establishes new efficiency frontiers with: CIFAR-10: 95.57% accuracy (0.85M parameters) and 93.80% (0.37M parameters) CIFAR-100: 81.37% accuracy (0.92M parameters) and 74.85% (0.44M parameters) The breakthrough stems from three fundamental innovations in DSPA: (1) Adaptive Fusion: Learnt channel-spatial attention blending with dynamic weights. (2) Phase Scaling: Training-stage-aware intensity modulation (from 0.5 to 1.0). (3) Residual Adaptation: Self-optimized skip connections (gamma from 0.5 to 0.72). By integrating DSPA with enhanced MBConv blocks, FastBoost achieves a 2.1 times parameter reduction over MobileNetV3 while improving accuracy by +3.2 percentage points on CIFAR-10. The architecture features dual attention pathways with real-time weight adjustment, cascaded refinement layers (increasing gradient flow by 12.7%), and a hardware-friendly design (0.28G FLOPs). This co-optimization of dynamic attention and efficient convolution operations demonstrates unprecedented parameter-accuracy trade-offs, enabling deployment in resource-constrained edge devices without accuracy degradation.
comment: 17pages , 10figures , 12tables
☆ HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images
B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.
comment: This manuscript has been submitted to Informatics in Medicine Unlocked
☆ Integrating Visual and X-Ray Machine Learning Features in the Study of Paintings by Goya
Art authentication of Francisco Goya's works presents complex computational challenges due to his heterogeneous stylistic evolution and extensive historical patterns of forgery. We introduce a novel multimodal machine learning framework that applies identical feature extraction techniques to both visual and X-ray radiographic images of Goya paintings. The unified feature extraction pipeline incorporates Grey-Level Co-occurrence Matrix descriptors, Local Binary Patterns, entropy measures, energy calculations, and colour distribution analysis applied consistently across both imaging modalities. The extracted features from both visual and X-ray images are processed through an optimised One-Class Support Vector Machine with hyperparameter tuning. Using a dataset of 24 authenticated Goya paintings with corresponding X-ray images, split into an 80/20 train-test configuration with 10-fold cross-validation, the framework achieves 97.8% classification accuracy with a 0.022 false positive rate. Case study analysis of ``Un Gigante'' demonstrates the practical efficacy of our pipeline, achieving 92.3% authentication confidence through unified multimodal feature analysis. Our results indicate substantial performance improvement over single-modal approaches, establishing the effectiveness of applying identical computational methods to both visual and radiographic imagery in art authentication applications.
☆ MID: A Self-supervised Multimodal Iterative Denoising Framework
Data denoising is a persistent challenge across scientific and engineering domains. Real-world data is frequently corrupted by complex, non-linear noise, rendering traditional rule-based denoising methods inadequate. To overcome these obstacles, we propose a novel self-supervised multimodal iterative denoising (MID) framework. MID models the collected noisy data as a state within a continuous process of non-linear noise accumulation. By iteratively introducing further noise, MID learns two neural networks: one to estimate the current noise step and another to predict and subtract the corresponding noise increment. For complex non-linear contamination, MID employs a first-order Taylor expansion to locally linearize the noise process, enabling effective iterative removal. Crucially, MID does not require paired clean-noisy datasets, as it learns noise characteristics directly from the noisy inputs. Experiments across four classic computer vision tasks demonstrate MID's robustness, adaptability, and consistent state-of-the-art performance. Moreover, MID exhibits strong performance and adaptability in tasks within the biomedical and bioinformatics domains.
☆ Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
Image generation models are usually personalized in practical uses in order to better meet the individual users' heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56\%, when different personalization scenarios are applied to multiple types of image generation models.
♻ ☆ CanadaFireSat: Toward high-resolution wildfire forecasting with multiple modalities
Canada experienced in 2023 one of the most severe wildfire seasons in recent history, causing damage across ecosystems, destroying communities, and emitting large quantities of CO2. This extreme wildfire season is symptomatic of a climate-change-induced increase in the length and severity of the fire season that affects the boreal ecosystem. Therefore, it is critical to empower wildfire management in boreal communities with better mitigation solutions. Wildfire probability maps represent an important tool for understanding the likelihood of wildfire occurrence and the potential severity of future wildfires. The massive increase in the availability of Earth observation data has enabled the development of deep learning-based wildfire forecasting models, aiming at providing precise wildfire probability maps at different spatial and temporal scales. A main limitation of such methods is their reliance on coarse-resolution environmental drivers and satellite products, leading to wildfire occurrence prediction of reduced resolution, typically around $\sim 0.1${\deg}. This paper presents a benchmark dataset: CanadaFireSat, and baseline methods for high-resolution: 100 m wildfire forecasting across Canada, leveraging multi-modal data from high-resolution multi-spectral satellite images (Sentinel-2 L1C), mid-resolution satellite products (MODIS), and environmental factors (ERA5 reanalysis data). Our experiments consider two major deep learning architectures. We observe that using multi-modal temporal inputs outperforms single-modal temporal inputs across all metrics, achieving a peak performance of 60.3% in F1 score for the 2023 wildfire season, a season never seen during model training. This demonstrates the potential of multi-modal deep learning models for wildfire forecasting at high-resolution and continental scale.
comment: 34 pages, 11 figures
♻ ☆ Learning to Steer: Input-dependent Steering for Multimodal LLMs NeurIPS 2025
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/
comment: NeurIPS 2025
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ Dropping the D: RGB-D SLAM Without the Depth Sensor
We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
♻ ☆ As Good as It KAN Get: High-Fidelity Audio Representation CIKM '25
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
comment: Accepted to the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)
♻ ☆ Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation
Cardiac substructure delineation is emerging in treatment planning to minimize the risk of radiation-induced heart disease. Deep learning offers efficient methods to reduce contouring burden but currently lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) deep-learning pipeline for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net backbone to handle multi-modality inputs and overlapping labels. First benchmarked on the multi-modality whole-heart segmentation (MMWHS) dataset including cardiac CT-angiography (CCTA) and MR modalities, twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from clinical simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT-angiography (CCTA) modalities were delineated to train semi-supervised (n=151), validate (n=15), and test (n=30) MAGIC. For comparison, fourteen single-modality comparison models (two MMWHS modalities and four subgroups across three clinical modalities) were trained. Methods were evaluated for efficiency and against reference contours through the Dice similarity coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (p<0.05). Average MMWHS DSC scores across CCTA and MR inputs were 0.88(0.08) and 0.87(0.04) respectively with significant improvement over unimodal baselines. Average 20-structure DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. Furthermore, >80% and >70% reductions in training time and parameters were achieved, respectively. MAGIC offers an efficient, lightweight solution capable of segmenting multiple image modalities and overlapping structures in a single model without compromising segmentation accuracy.
♻ ☆ LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking
Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room. Our code is publicly available at https://github.com/ImFusionGmbH/lite-tracker.
♻ ☆ AdaSCALE: Adaptive Scaling for OOD Detection SC
The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample's estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94% in near-OOD and 21.67% in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/
comment: https://github.com/sudarshanregmi/AdaSCALE/
♻ ☆ A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.
comment: Currently Under Review
♻ ☆ Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data ICCV 2025
AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
comment: ICCV 2025
Robotics 5
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse
Robots with anthropomorphic features are increasingly shaping how humans perceive and morally engage with them. Our research investigates how different levels of anthropomorphism influence protective responses to robot abuse, extending the Computers as Social Actors (CASA) and uncanny valley theories into a moral domain. In an experiment, we invite 201 participants to view videos depicting abuse toward a robot with low (Spider), moderate (Two-Foot), or high (Humanoid) anthropomorphism. To provide a comprehensive analysis, we triangulate three modalities: self-report surveys measuring emotions and uncanniness, physiological data from automated facial expression analysis, and qualitative reflections. Findings indicate that protective responses are not linear. The moderately anthropomorphic Two-Foot robot, rated highest in eeriness and "spine-tingling" sensations consistent with the uncanny valley, elicited the strongest physiological anger expressions. Self-reported anger and guilt are significantly higher for both the Two-Foot and Humanoid robots compared to the Spider. Qualitative findings further reveal that as anthropomorphism increases, moral reasoning shifts from technical assessments of property damage to condemnation of the abuser's character, while governance proposals expand from property law to calls for quasi-animal rights and broader societal responsibility. These results suggest that the uncanny valley does not dampen moral concern but paradoxically heightens protective impulses, offering critical implications for robot design, policy, and future legal frameworks.
♻ ☆ Knolling Bot: Teaching Robots the Human Notion of Tidiness NeurIPS 2025
For robots to truly collaborate and assist humans, they must understand not only logic and instructions, but also the subtle emotions, aesthetics, and feelings that define our humanity. Human art and aesthetics are among the most elusive concepts-often difficult even for people to articulate-and without grasping these fundamentals, robots will be unable to help in many spheres of daily life. Consider the long-promised robotic butler: automating domestic chores demands more than motion planning. It requires an internal model of cleanliness and tidiness-a challenge largely unexplored by AI. To bridge this gap, we propose an approach that equips domestic robots to perform simple tidying tasks via knolling, the practice of arranging scattered items into neat, space-efficient layouts. Unlike the uniformity of industrial settings, household environments feature diverse objects and highly subjective notions of tidiness. Drawing inspiration from NLP, we treat knolling as a sequential prediction problem and employ a transformer based model to forecast each object's placement. Our method learns a generalizable concept of tidiness, generates diverse solutions adaptable to varying object sets, and incorporates human preferences for personalized arrangements. This work represents a step forward in building robots that internalize human aesthetic sense and can genuinely co-create in our living spaces.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track
♻ ☆ MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
comment: Project Page: https://umass-embodied-agi.github.io/MindJourney
♻ ☆ World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.
Artificial Intelligence 64
♻ ☆ Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models
Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous "reversal curse" or learning under data-constrained scenarios. In addition, taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction per step. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by static decoding when enough computation and memory budget is provided. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to boost the inference throughput up to $3.78\times$ without performance degradation.
♻ ☆ EmbeddingGemma: Powerful and Lightweight Text Representations
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
comment: 18 pages. Models are available in HuggingFace (at https://huggingface.co/collections/google/embeddinggemma-68b9ae3a72a82f0562a80dc4), Kaggle (at https://www.kaggle.com/models/google/embeddinggemma/), and Vertex AI (at https://pantheon.corp.google.com/vertex-ai/publishers/google/model-garden/embeddinggemma)
♻ ☆ CausalARC: Abstract Reasoning with Causal World Models
On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.
comment: Peer-reviewed workshop paper
♻ ☆ AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction EMNLP
Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
comment: EMNLP Main 2025
♻ ☆ LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory NeurIPS 2025
Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.
comment: Accepted by NeurIPS 2025
♻ ☆ Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction and selection.
♻ ☆ Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? AACL
The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.
comment: 24 pages, 3 figures, 13 tables. The paper is accepted at AACL-IJCNLP 2025 (main track), and the latest version adds modifications in camera-ready
♻ ☆ Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally NeurIPS 2025
Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
comment: Accepted at NeurIPS 2025 (main conference)
♻ ☆ Exploring Large Language Models for Detecting Mental Disorders EMNLP 2025
This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five Russian-language datasets were considered, each differing in format and in the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
comment: Accepted to EMNLP 2025
♻ ☆ OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews
We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.
comment: COLM 2025
♻ ☆ ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous fast sampling methods with fixed analytical form are not able to robust with the various error patterns in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower errors in the estimated noise. The proposed solver can be directly applied to any pretrained diffusion models, without extra training. Experiments on Cifar10, CelebA, LSUN-Church, and ImageNet 64 x 64 (conditional) datasets demonstrate that our proposed ERA-Solver achieves 3.54, 5.06, 5.02, and 5.11 Frechet Inception Distance (FID) for image generation, with only 10 network evaluations.
♻ ☆ AI-Generated Video Detection via Perceptual Straightening NeurIPS 2025
The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
comment: Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
♻ ☆ STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack
Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. STACKFEED defines a ReACT actor agent on each document to perform structured edits based on document specific targeted instructions. Experimental results showcase that STACKFEED significantly improves KB quality and performance of the RAG system. We evaluate STACKFEED on low-resource programming problems, modified python packaged and factual question-answering tasks.
♻ ☆ A Study in Dataset Distillation for Image Super-Resolution
Dataset distillation aims to compress large datasets into compact yet highly informative subsets that preserve the training behavior of the original data. While this concept has gained traction in classification, its potential for image Super-Resolution (SR) remains largely untapped. In this work, we conduct the first systematic study of dataset distillation for SR, evaluating both pixel- and latent-space formulations. We show that a distilled dataset, occupying only 8.88% of the original size, can train SR models that retain nearly the same reconstruction fidelity as those trained on full datasets. Furthermore, we analyze how initialization strategies and distillation objectives affect efficiency, convergence, and visual quality. Our findings highlight the feasibility of SR dataset distillation and establish foundational insights for memory- and compute-efficient generative restoration models.
♻ ☆ Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
comment: Preprint Under Review
♻ ☆ Continuous Classification Aggregation
We prove that any optimal, independent, and zero unanimous fuzzy classification aggregation function of a continuum of individual classifications of $m\ge 3$ objects into $2\le p\le m$ types must be a weighted arithmetic mean. We also provide a characterization for the case when $m=p=2$.
comment: 18 pages
♻ ☆ Knowledge-guided Continual Learning for Behavioral Analytics Systems
User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it's helpful messages or hate speech. Models that learn to capture this content can experience a decrease in performance over time due to data drift, which can lead to ineffective behavioral analytics systems. However, fine-tuning such a model over time with new data can be detrimental due to catastrophic forgetting. Replay-based approaches in continual learning offer a simple yet efficient method to update such models, minimizing forgetting by maintaining a buffer of important training instances from past learned tasks. However, the main limitation of this approach is the fixed size of the buffer. External knowledge bases can be utilized to overcome this limitation through data augmentation. We propose a novel augmentation-based approach to incorporate external knowledge in the replay-based continual learning framework. We evaluate several strategies with three datasets from prior studies related to deviant behavior classification to assess the integration of external knowledge in continual learning and demonstrate that augmentation helps outperform baseline replay-based approaches.
comment: This is a preprint of the accepted paper at IEEE CogMI 2025 - The 7th IEEE International Conference on Cognitive Machine Intelligence
♻ ☆ Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction
Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,261 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.807), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.
♻ ☆ Learning Repetition-Invariant Representations for Polymer Informatics NeurIPS 2025
Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.
comment: Accepted to NeurIPS 2025
♻ ☆ Balancing Caregiving and Self-Care: Exploring Mental Health Needs of Alzheimer's and Dementia Caregivers SC
Alzheimer's Disease and Related Dementias (AD/ADRD) are progressive neurodegenerative conditions that impair memory, thought processes, and functioning. Family caregivers of individuals with AD/ADRD face significant mental health challenges due to long-term caregiving responsibilities. Yet, current support systems often overlook the evolving nature of their mental wellbeing needs. Our study examines caregivers' mental wellbeing concerns, focusing on the practices they adopt to manage the burden of caregiving and the technologies they use for support. Through semi-structured interviews with 25 family caregivers of individuals with AD/ADRD, we identified the key causes and effects of mental health challenges, and developed a temporal mapping of how caregivers' mental wellbeing evolves across three distinct stages of the caregiving journey. Additionally, our participants shared insights into improvements for existing mental health technologies, emphasizing the need for accessible, scalable, and personalized solutions that adapt to caregivers' changing needs over time. These findings offer a foundation for designing dynamic, stage-sensitive interventions that holistically support caregivers' mental wellbeing, benefiting both caregivers and care recipients.
comment: This work is published at the Proceedings of the ACM on Human-Computer Interaction (CSCW), 2025
♻ ☆ Recognising, Anticipating, and Mitigating LLM Pollution of Online Behavioural Research
Online behavioural research faces an emerging threat as participants increasingly turn to large language models (LLMs) for advice, translation, or task delegation: LLM Pollution. We identify three interacting variants through which LLM Pollution threatens the validity and integrity of online behavioural research. First, Partial LLM Mediation occurs when participants make selective use of LLMs for specific aspects of a task, such as translation or wording support, leading researchers to (mis)interpret LLM-shaped outputs as human ones. Second, Full LLM Delegation arises when agentic LLMs complete studies with little to no human oversight, undermining the central premise of human-subject research at a more foundational level. Third, LLM Spillover signifies human participants altering their behaviour as they begin to anticipate LLM presence in online studies, even when none are involved. While Partial Mediation and Full Delegation form a continuum of increasing automation, LLM Spillover reflects second-order reactivity effects. Together, these variants interact and generate cascading distortions that compromise sample authenticity, introduce biases that are difficult to detect post hoc, and ultimately undermine the epistemic grounding of online research on human cognition and behaviour. Crucially, the threat of LLM Pollution is already co-evolving with advances in generative AI, creating an escalating methodological arms race. To address this, we propose a multi-layered response spanning researcher practices, platform accountability, and community efforts. As the challenge evolves, coordinated adaptation will be essential to safeguard methodological integrity and preserve the validity of online behavioural research.
♻ ☆ Will Humanity Be Rendered Obsolete by AI?
This article analyzes the existential risks artificial intelligence (AI) poses to humanity, tracing the trajectory from current AI to ultraintelligence. Drawing on Irving J. Good and Nick Bostrom's theoretical work, plus recent publications (AI 2027; If Anyone Builds It, Everyone Dies), it explores AGI and superintelligence. Considering machines' exponentially growing cognitive power and hypothetical IQs, it addresses the ethical and existential implications of an intelligence vastly exceeding humanity's, fundamentally alien. Human extinction may result not from malice, but from uncontrollable, indifferent cognitive superiority.
♻ ☆ Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages fine-grained cross-model consistency to detect and mitigate hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves up to 9 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation on multiple datasets demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
♻ ☆ Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection
With the deployment of Large Language Models (LLMs) in interactive applications, online malicious intent detection has become increasingly critical. However, existing approaches fall short of handling diverse and complex user queries in real time. To address these challenges, we introduce ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage framework for robust and efficient online malicious intent detection. In the training stage, a high-capacity teacher model is trained on adversarially perturbed, retrieval-augmented inputs to learn robust decision boundaries over diverse and complex user queries. In the inference stage, a distillation scheduler transfers the teacher's knowledge into a compact student model, with a continually updated knowledge base collected online. At deployment, the compact student model leverages top-K similar safety exemplars retrieved from the online-updated knowledge base to enable both online and real-time malicious query detection. Evaluations across ten safety benchmarks demonstrate that ADRAG, with a 149M-parameter model, achieves 98.5% of WildGuard-7B's performance, surpasses GPT-4 by 3.3% and Llama-Guard-3-8B by 9.5% on out-of-distribution detection, while simultaneously delivering up to 5.6x lower latency at 300 queries per second (QPS) in real-time applications.
♻ ☆ A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models EMNLP 2025
Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent'' reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.
comment: Accepted in EMNLP 2025, 34 pages, 25 figures
♻ ☆ Combinative Matching for Geometric Shape Assembly ICCV 2025
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.
comment: Accepted to ICCV 2025 (Highlight)
♻ ☆ What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.
♻ ☆ On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
♻ ☆ Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models ICML 2025
A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
comment: Accepted to ICML 2025 (Oral)
♻ ☆ RL Fine-Tuning Heals OOD Forgetting in SFT
The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim "SFT memorizes, RL generalizes" is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT
comment: 24 pages, 18 figures
♻ ☆ A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.
♻ ☆ Knolling Bot: Teaching Robots the Human Notion of Tidiness NeurIPS 2025
For robots to truly collaborate and assist humans, they must understand not only logic and instructions, but also the subtle emotions, aesthetics, and feelings that define our humanity. Human art and aesthetics are among the most elusive concepts-often difficult even for people to articulate-and without grasping these fundamentals, robots will be unable to help in many spheres of daily life. Consider the long-promised robotic butler: automating domestic chores demands more than motion planning. It requires an internal model of cleanliness and tidiness-a challenge largely unexplored by AI. To bridge this gap, we propose an approach that equips domestic robots to perform simple tidying tasks via knolling, the practice of arranging scattered items into neat, space-efficient layouts. Unlike the uniformity of industrial settings, household environments feature diverse objects and highly subjective notions of tidiness. Drawing inspiration from NLP, we treat knolling as a sequential prediction problem and employ a transformer based model to forecast each object's placement. Our method learns a generalizable concept of tidiness, generates diverse solutions adaptable to varying object sets, and incorporates human preferences for personalized arrangements. This work represents a step forward in building robots that internalize human aesthetic sense and can genuinely co-create in our living spaces.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track
♻ ☆ A Collectivist, Economic Perspective on AI
Information technology is in the midst of a revolution in which omnipresent data collection and machine learning are impacting the human world as never before. The word "intelligence" is being used as a North Star for the development of this technology, with human cognition viewed as a baseline. This view neglects the fact that humans are social animals and that much of our intelligence is social and cultural in origin. Moreover, failing to properly situate aspects of intelligence at the social level contributes to the treatment of the societal consequences of technology as an afterthought. The path forward is not merely more data and compute, and not merely more attention paid to cognitive or symbolic representations, but a thorough blending of economic and social concepts with computational and inferential concepts at the level of algorithm design.
♻ ☆ THFlow: A Temporally Hierarchical Flow Matching Framework for 3D Peptide Design
Deep generative models provide a promising approach to de novo 3D peptide design. Most of them jointly model the distributions of peptide's position, orientation, and conformation, attempting to simultaneously converge to the target pocket. However, in the early stage of docking, optimizing conformation-only modalities such as rotation and torsion can be physically meaningless, as the peptide is initialized far from the protein pocket and no interaction field is present. We define this problem as the multimodal temporal inconsistency problem and claim it is a key factor contributing to low binding affinity in generated peptides. To address this challenge, we propose THFlow, a novel flow matching-based multimodal generative model that explicitly models the temporal hierarchy between peptide position and conformation. It employs a polynomial based conditional flow to accelerate positional convergence early on, and later aligns it with rotation and torsion for coordinated conformation refinement under the emerging interaction field. Additionally, we incorporate interaction-related features, such as polarity, to further enhance the model's understanding of peptide-protein binding. Extensive experiments demonstrate that THFlow outperforms existing methods in generating peptides with superior stability, affinity, and diversity, offering an effective and accurate solution for advancing peptide-based therapeutic development.
♻ ☆ PPMI: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases
Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user's private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.
comment: 29 pages
♻ ☆ Incivility and Rigidity: Evaluating the Risks of Fine-Tuning LLMs for Political Argumentation
Incivility on platforms such as Twitter (now X) and Reddit complicates the development of AI systems that can support productive, rhetorically sound political argumentation. We present experiments with \textit{GPT-3.5 Turbo} fine-tuned on two contrasting datasets of political discourse: high-incivility Twitter replies to U.S. Congress and low-incivility posts from Reddit's \textit{r/ChangeMyView}. Our evaluation examines how data composition and prompting strategies affect the rhetorical framing and deliberative quality of model-generated arguments. Results show that Reddit-finetuned models generate safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies adversarial tone and toxicity. Prompt-based steering reduces overt toxicity (e.g., personal attacks) but cannot fully offset the influence of noisy training data. We introduce a rhetorical evaluation rubric - covering justification, reciprocity, alignment, and authority - and provide implementation guidelines for authoring, moderation, and deliberation-support systems.
♻ ☆ Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research
The preparation procedures of materials are often embedded narratively in experimental protocols, research articles, patents, and laboratory notes, and are structured around procedural sequences, causal relationships, and conditional logic. The synthesis of boron nitride nanosheet (BNNS) polymer composites exemplifies this linguistically encoded decision-making system, where the practical experiments involve interdependent multistage and path-dependent processes such as exfoliation, functionalization, and dispersion, each governed by heterogeneous parameters and contextual contingencies, challenging conventional numerical optimization paradigms for experiment design. We reformulate this challenge into a text-reasoning problem through a framework centered on a text-first, lightly structured materials database and large language models (LLMs) as text reasoning engines. We constructed a database that captures evidence-linked narrative excerpts from the literature while normalizing only the minimum necessary entities, attributes, and relations to enable composite retrieval that unifies semantic matching, lexical cues, and explicit value filters. Building on this language-native, provenance-preserving foundation, the LLM operates in two complementary modes: retrieval-augmented generation (RAG), grounding outputs in retrieved evidence modules from the database, and experience-augmented reasoning (EAR), which leverages iteratively trained text guides derived from multi-source literature-based narrative data as external references to inform reasoning and decision-making. Applying this integration-and-reasoning framework, we demonstrate rapid, laboratory-scale optimization of BNNS preparation, highlighting how language-native data combined with LLM-based reasoning can significantly accelerate practical material preparation.
♻ ☆ DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models
Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the type of sound events, as well as the direction and distance of their corresponding sources. Accomplishing this with a single audio encoder is demanding as the information required for each of these tasks is mostly independent of each other. As a result, the performance obtained with a single encoder is often worse than when using task-specific audio encoders. In this work, we present DSpAST, a novel audio encoder based on SpatialAST that learns disentangled representations of spatial audio while having only 0.2% additional parameters. Experiments on SpatialSoundQA with the spatial audio reasoning system BAT demonstrate that DSpAST significantly outperforms SpatialAST.
♻ ☆ Policy Optimized Text-to-Image Pipeline Design
Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
♻ ☆ ADPO: Anchored Direct Preference Optimization
Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a generalized framework that learns from soft, listwise supervision by anchoring policy updates to a reference model. Our key theoretical contribution is to show that this anchoring mechanism imposes an implicit trust region on the policy update, enforced by the softmax Fisher information metric. This provides a robust geometric interpretation for both fixed and dynamic anchor strategies. Our central empirical finding is a task-dependent tradeoff between anchor update strategies. Through controlled experiments across twelve scenarios and two MuJoCo environments, we demonstrate that (1) for online exploration in noisy environments, a dynamic anchor that tracks the learning policy is superior, improving performance by 5 to 11 percent over a fixed anchor; and (2) for offline distillation, a fixed anchor pointing to the teacher policy is dramatically more effective, achieving returns of 206.7 on HalfCheetah-v5 (387 percent of teacher) and 65.4 on Hopper-v5 (61 percent of teacher), while reducing KL divergence to the teacher by up to 5000 times compared with standard knowledge distillation. These findings offer clear, practical guidance for selecting anchor strategies and establish ADPO as a robust, unified framework for preference learning. Larger models further amplify ADPO's benefits (0.718 vs. 0.416 at hidden dimension 256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.
♻ ☆ MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification EMNLP 2025
We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques -- heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch -- resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.
comment: This is the camera-ready version of the paper, accepted for publication in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
♻ ☆ The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) -- a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes -- Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic & Semantic Stability, and Persona/Role Continuity -- and explains why current architectures systematically fail to support them. Case analyses (Character.\,AI, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.
comment: 33 pages, 127 references v2: Minor editorial revision: redundant phrasing reduced, punctuation and formatting improved; no conceptual or data changes
♻ ☆ H-NeiFi: Non-Invasive and Consensus-Efficient Multi-Agent Opinion Guidance
The openness of social media enables the free exchange of opinions, but it also presents challenges in guiding opinion evolution towards global consensus. Existing methods often directly modify user views or enforce cross-group connections. These intrusive interventions undermine user autonomy, provoke psychological resistance, and reduce the efficiency of global consensus. Additionally, due to the lack of a long-term perspective, promoting local consensus often exacerbates divisions at the macro level. To address these issues, we propose the hierarchical, non-intrusive opinion guidance framework, H-NeiFi. It first establishes a two-layer dynamic model based on social roles, considering the behavioral characteristics of both experts and non-experts. Additionally, we introduce a non-intrusive neighbor filtering method that adaptively controls user communication channels. Using multi-agent reinforcement learning (MARL), we optimize information propagation paths through a long-term reward function, avoiding direct interference with user interactions. Experiments show that H-NeiFi increases consensus speed by 22.0% to 30.7% and maintains global convergence even in the absence of experts. This approach enables natural and efficient consensus guidance by protecting user interaction autonomy, offering a new paradigm for social network governance.
♻ ☆ Exploring the Hidden Capacity of LLMs for One-Step Text Generation EMNLP2025
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.
comment: accepted to EMNLP2025 main
♻ ☆ VideoExplorer: Think With Videos For Agentic Long-Video Understanding
Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
♻ ☆ Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
♻ ☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
♻ ☆ KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems NeurIPS2025
Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.
comment: Accepted for publication in NeurIPS2025. Code is available at \url{https://github.com/FastMAS/KVCOMM}
♻ ☆ DMol: A Highly Efficient and Chemical Motif-Preserving Molecule Generation Platform
We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10-fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a graph noise scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2%, increases the novelty of the method, and further improves the running time due to reductions in the graph size.
♻ ☆ Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models NeurIPS 2025
Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose $\textbf{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that FaIRMaker automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.
comment: Accepted to NeurIPS 2025
♻ ☆ IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? AACL 2025
Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].
comment: 25 pages, 11 figures, Accepted at IJCNLP-AACL 2025 Findings
♻ ☆ Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused exclusively on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity between them. Yet, abstracts offer only sparse and high-level summaries, and such methods primarily optimize one-to-one similarity, overlooking the dynamic relations that emerge among relevant papers during the retrieval process. To address this, we propose Chain of Retrieval(COR), a novel iterative framework for full-paper retrieval. Specifically, CoR decomposes each query paper into multiple aspect-specific views, matches them against segmented candidate papers, and iteratively expands the search by promoting top-ranked results as new queries, thereby forming a tree-structured retrieval process. The resulting retrieval tree is then aggregated in a post-order manner: descendants are first combined at the query level, then recursively merged with their parent nodes, to capture hierarchical relations across iterations. To validate this, we present SCIFULLBENCH, a large-scale benchmark providing both complete and segmented contexts of full papers for queries and candidates, and results show that CoR significantly outperforms existing retrieval baselines. Our code and dataset is available at https://github.com/psw0021/Chain-of-Retrieval.git.
♻ ☆ MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow's effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents' proficiency in real-world MCP environments. MCP-Flow is publicly available at \href{https://github.com/wwh0411/MCP-Flow}{https://github.com/wwh0411/MCP-Flow}.
comment: Preprint, Under Review
♻ ☆ Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval
The rapid development of Vision-language models (VLMs) enables open-ended perception and reasoning. Recent works have started to investigate how to adapt general-purpose VLMs into personalized assistants. Even commercial models such as ChatGPT now support model personalization by incorporating user-specific information. However, existing methods either learn a set of concept tokens or train a VLM to utilize user-specific information. However, both pipelines struggle to generate accurate answers as personalized assistants. We introduce Jarvis, an innovative framework for a personalized AI assistant through personal KV-Cache retrieval, which stores user-specific information in the KV-Caches of both textual and visual tokens. The textual tokens are created by summarizing user information into metadata, while the visual tokens are produced by extracting distinct image patches from the user's images. When answering a question, Jarvis first retrieves related KV-Caches from personal storage and uses them to ensure accuracy in responses. We also introduce a fine-grained benchmark built with the same distinct image patch mining pipeline, emphasizing accurate question answering based on fine-grained user-specific information. Jarvis is capable of providing more accurate responses, particularly when they depend on specific local details. Jarvis achieves state-of-the-art results in both visual question answering and text-only tasks across multiple datasets, indicating a practical path toward personalized AI assistants. The code and dataset will be released.
comment: 19 pages, 7 figures
♻ ☆ Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships
While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.
comment: 20 pages, 7 figures, published in CHI 2025
♻ ☆ MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
comment: Project Page: https://umass-embodied-agi.github.io/MindJourney
♻ ☆ CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.
♻ ☆ PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning
Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.
♻ ☆ Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training NeurIPS 2025
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
comment: Published at NeurIPS 2025
♻ ☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
comment: 50 pages
♻ ☆ Survey Transfer Learning: Recycling Data with Silicon Responses SP
As researchers increasingly turn to large language models (LLMs) to generate synthetic survey data, less attention has been paid to alternative AI paradigms given environmental costs of LLMs. This paper introduces Survey Transfer Learning (STL), which develops transfer learning paradigms from computer science for survey research to recycle existing survey data and generate empirically grounded silicon responses. Inspired by political behavior theory, STL leverages shared demographic variables with high predictive power in a polarized American context to transfer knowledge across surveys. Using a neural network pre-trained on the Cooperative Election Study (CES) 2020, freezing early layers to preserve learned structure, and fine-tuning top layers on the American National Election Studies (ANES) 2020, STL generates silicon responses CES 2022 and in held-out ANES 2020 data with accuracy rates of up to 93 percent. Results show that STL outperforms LLMs, especially on sensitive measures such as racial resentment. While LLMs silicon samples are costly and opaque, STL generates empirically grounded silicon responses with high individual-level accuracy, potentially helping to mitigate key challenges in social science and the polling industry.
comment: Revised and expanded version (v2). 32 pages, 11 figures. Under review at Political Analysis. Presented at SPSA 2025 (Political Methodology Panel)
♻ ☆ Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
♻ ☆ Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity NeurIPS 2025
This work proposes the Lattice Boltzmann Model (LBM) to learn real-world pixel dynamicity for visual tracking. LBM decomposes visual representations into dynamic pixel lattices and solves pixel motion states through collision-streaming processes. Specifically, the high-dimensional distribution of the target pixels is acquired through a multilayer predict-update network to estimate the pixel positions and visibility. The predict stage formulates lattice collisions among the spatial neighborhood of target pixels and develops lattice streaming within the temporal visual context. The update stage rectifies the pixel distributions with online visual representations. Compared with existing methods, LBM demonstrates practical applicability in an online and real-time manner, which can efficiently adapt to real-world visual tracking tasks. Comprehensive evaluations of real-world point tracking benchmarks such as TAP-Vid and RoboTAP validate LBM's efficiency. A general evaluation of large-scale open-world object tracking benchmarks such as TAO, BFT, and OVT-B further demonstrates LBM's real-world practicality.
comment: NeurIPS 2025. Project page: https://george-zhuang.github.io/lbm/
♻ ☆ Targeted Attack Improves Protection against Unauthorized Diffusion Customization ICLR 2025
Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization. Our code is available on GitHub: https://github.com/psyker-team/mist-v2.
comment: ICLR 2025 (Spotlight)
Machine Learning 71
♻ ☆ Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models
Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous "reversal curse" or learning under data-constrained scenarios. In addition, taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction per step. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by static decoding when enough computation and memory budget is provided. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to boost the inference throughput up to $3.78\times$ without performance degradation.
♻ ☆ Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D\'ej\`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.
♻ ☆ CausalARC: Abstract Reasoning with Causal World Models
On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.
comment: Peer-reviewed workshop paper
♻ ☆ LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory NeurIPS 2025
Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.
comment: Accepted by NeurIPS 2025
♻ ☆ Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction and selection.
♻ ☆ Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? AACL
The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.
comment: 24 pages, 3 figures, 13 tables. The paper is accepted at AACL-IJCNLP 2025 (main track), and the latest version adds modifications in camera-ready
♻ ☆ Nirvana AI Governance: How AI Policymaking Is Committing Three Old Fallacies
This research applies Harold Demsetz's concept of the nirvana approach to the realm of AI governance and debunks three common fallacies in various AI policy proposals--"the grass is always greener on the other side," "free lunch," and "the people could be different." Through this, I expose fundamental flaws in the current AI regulatory proposal. First, some commentators intuitively believe that people are more reliable than machines and that government works better in risk control than companies' self-regulation, but they do not fully compare the differences between the status quo and the proposed replacements. Second, when proposing some regulatory tools, some policymakers and researchers do not realize and even gloss over the fact that harms and costs are also inherent in their proposals. Third, some policy proposals are initiated based on a false comparison between the AI-driven world, where AI does lead to some risks, and an entirely idealized world, where no risk exists at all. However, the appropriate approach is to compare the world where AI causes risks to the real world where risks are everywhere, but people can live well with these risks. The prevalence of these fallacies in AI governance underscores a broader issue: the tendency to idealize potential solutions without fully considering their real-world implications. This idealization can lead to regulatory proposals that are not only impractical but potentially harmful to innovation and societal progress.
comment: 9 pages
♻ ☆ Topic Analysis with Side Information: A Neural-Augmented LDA Approach
Traditional topic models such as Latent Dirichlet Allocation (LDA) have been widely used to uncover latent structures in text corpora, but they often struggle to integrate auxiliary information such as metadata, user attributes, or document labels. These limitations restrict their expressiveness, personalization, and interpretability. To address this, we propose nnLDA, a neural-augmented probabilistic topic model that dynamically incorporates side information through a neural prior mechanism. nnLDA models each document as a mixture of latent topics, where the prior over topic proportions is generated by a neural network conditioned on auxiliary features. This design allows the model to capture complex nonlinear interactions between side information and topic distributions that static Dirichlet priors cannot represent. We develop a stochastic variational Expectation-Maximization algorithm to jointly optimize the neural and probabilistic components. Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification. These results highlight the benefits of combining neural representation learning with probabilistic topic modeling in settings where side information is available.
♻ ☆ Tricks and Plug-ins for Gradient Boosting with Transformers
Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.
comment: Update the title of the pdf file only. The old version has a different title to the arxiv abstract. arXiv admin note: substantial text overlap with arXiv:2203.00761
♻ ☆ Tricks and Plug-ins for Gradient Boosting in Image Classification
Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive to train, requiring extensive time and manual tuning to discover optimal architectures. In this paper, we introduce a novel framework for boosting CNN performance that integrates dynamic feature selection with the principles of BoostCNN. Our approach incorporates two key strategies: subgrid selection and importance sampling, to guide training toward informative regions of the feature space. We further develop a family of algorithms that embed boosting weights directly into the network training process using a least squares loss formulation. This integration not only alleviates the burden of manual architecture design but also enhances accuracy and efficiency. Experimental results across several fine-grained classification benchmarks demonstrate that our boosted CNN variants consistently outperform conventional CNNs in both predictive performance and training speed.
comment: 10 pages, 5 figures. Experimental results reported on CIFAR-10, SVHN, and ImageNetSub datasets. arXiv admin note: substantial text overlap with arXiv:2203.00761
♻ ☆ ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous fast sampling methods with fixed analytical form are not able to robust with the various error patterns in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower errors in the estimated noise. The proposed solver can be directly applied to any pretrained diffusion models, without extra training. Experiments on Cifar10, CelebA, LSUN-Church, and ImageNet 64 x 64 (conditional) datasets demonstrate that our proposed ERA-Solver achieves 3.54, 5.06, 5.02, and 5.11 Frechet Inception Distance (FID) for image generation, with only 10 network evaluations.
♻ ☆ AI-Generated Video Detection via Perceptual Straightening NeurIPS 2025
The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.
comment: Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
♻ ☆ STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack
Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. STACKFEED defines a ReACT actor agent on each document to perform structured edits based on document specific targeted instructions. Experimental results showcase that STACKFEED significantly improves KB quality and performance of the RAG system. We evaluate STACKFEED on low-resource programming problems, modified python packaged and factual question-answering tasks.
♻ ☆ A Study in Dataset Distillation for Image Super-Resolution
Dataset distillation aims to compress large datasets into compact yet highly informative subsets that preserve the training behavior of the original data. While this concept has gained traction in classification, its potential for image Super-Resolution (SR) remains largely untapped. In this work, we conduct the first systematic study of dataset distillation for SR, evaluating both pixel- and latent-space formulations. We show that a distilled dataset, occupying only 8.88% of the original size, can train SR models that retain nearly the same reconstruction fidelity as those trained on full datasets. Furthermore, we analyze how initialization strategies and distillation objectives affect efficiency, convergence, and visual quality. Our findings highlight the feasibility of SR dataset distillation and establish foundational insights for memory- and compute-efficient generative restoration models.
♻ ☆ Optimal Execution with Reinforcement Learning
This study investigates the development of an optimal execution strategy through reinforcement learning, aiming to determine the most effective approach for traders to buy and sell inventory within a finite time horizon. Our proposed model leverages input features derived from the current state of the limit order book and operates at a high frequency to maximize control. To simulate this environment and overcome the limitations associated with relying on historical data, we utilize the multi-agent market simulator ABIDES, which provides a diverse range of depth levels within the limit order book. We present a custom MDP formulation followed by the results of our methodology and benchmark the performance against standard execution strategies. Results show that the reinforcement learning agent outperforms standard strategies and offers a practical foundation for real-world trading applications.
comment: 8 pages
♻ ☆ Knowledge-guided Continual Learning for Behavioral Analytics Systems
User behavior on online platforms is evolving, reflecting real-world changes in how people post, whether it's helpful messages or hate speech. Models that learn to capture this content can experience a decrease in performance over time due to data drift, which can lead to ineffective behavioral analytics systems. However, fine-tuning such a model over time with new data can be detrimental due to catastrophic forgetting. Replay-based approaches in continual learning offer a simple yet efficient method to update such models, minimizing forgetting by maintaining a buffer of important training instances from past learned tasks. However, the main limitation of this approach is the fixed size of the buffer. External knowledge bases can be utilized to overcome this limitation through data augmentation. We propose a novel augmentation-based approach to incorporate external knowledge in the replay-based continual learning framework. We evaluate several strategies with three datasets from prior studies related to deviant behavior classification to assess the integration of external knowledge in continual learning and demonstrate that augmentation helps outperform baseline replay-based approaches.
comment: This is a preprint of the accepted paper at IEEE CogMI 2025 - The 7th IEEE International Conference on Cognitive Machine Intelligence
♻ ☆ Learning Repetition-Invariant Representations for Polymer Informatics NeurIPS 2025
Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.
comment: Accepted to NeurIPS 2025
♻ ☆ SLIP: Securing LLMs IP Using Weights Decomposition
Large language models (LLMs) have recently seen widespread adoption in both academia and industry. As these models grow, they become valuable intellectual property (IP), reflecting substantial investments by their owners. The high cost of cloud-based deployment has spurred interest in running models on edge devices, but this risks exposing parameters to theft and unauthorized use. Existing approaches to protect model IP on the edge trade off practicality, accuracy, or deployment requirements. We introduce SLIP, a hybrid inference algorithm designed to protect edge-deployed models from theft. SLIP is, to our knowledge, the first hybrid protocol that is both practical for real-world applications and provably secure, while incurring zero accuracy degradation and minimal latency overhead. It partitions the model across two computing resources: one secure but expensive, and one cost-effective but vulnerable. Using matrix decomposition, the secure resource retains the most sensitive portion of the model's IP while performing only a small fraction of the computation; the vulnerable resource executes the remainder. The protocol includes security guarantees that prevent attackers from using the partition to infer the protected information. Finally, we present experimental results that demonstrate the robustness and effectiveness of our method, positioning it as a compelling solution for protecting LLMs.
♻ ☆ Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought NeurIPS 2025
Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D
comment: 26 pages, 7 figures, NeurIPS 2025
♻ ☆ Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages fine-grained cross-model consistency to detect and mitigate hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves up to 9 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation on multiple datasets demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
♻ ☆ Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection
With the deployment of Large Language Models (LLMs) in interactive applications, online malicious intent detection has become increasingly critical. However, existing approaches fall short of handling diverse and complex user queries in real time. To address these challenges, we introduce ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage framework for robust and efficient online malicious intent detection. In the training stage, a high-capacity teacher model is trained on adversarially perturbed, retrieval-augmented inputs to learn robust decision boundaries over diverse and complex user queries. In the inference stage, a distillation scheduler transfers the teacher's knowledge into a compact student model, with a continually updated knowledge base collected online. At deployment, the compact student model leverages top-K similar safety exemplars retrieved from the online-updated knowledge base to enable both online and real-time malicious query detection. Evaluations across ten safety benchmarks demonstrate that ADRAG, with a 149M-parameter model, achieves 98.5% of WildGuard-7B's performance, surpasses GPT-4 by 3.3% and Llama-Guard-3-8B by 9.5% on out-of-distribution detection, while simultaneously delivering up to 5.6x lower latency at 300 queries per second (QPS) in real-time applications.
♻ ☆ AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays MICCAI 2025
Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.
comment: This preprint has been accepted by MICCAI 2025
♻ ☆ Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres NeurIPS 2025
Recent advances in flow-based generative modelling have provided scalable methods for computing the Schr\"odinger Bridge (SB) between distributions, a dynamic form of entropy-regularised Optimal Transport (OT) for the quadratic cost. The successful Iterative Markovian Fitting (IMF) procedure solves the SB problem via sequential bridge-matching steps, presenting an elegant and practical approach with many favourable properties over the more traditional Iterative Proportional Fitting (IPF) procedure. Beyond the standard setting, optimal transport can be generalised to the multi-marginal case in which the objective is to minimise a cost defined over several marginal distributions. Of particular importance are costs defined over a tree structure, from which Wasserstein barycentres can be recovered as a special case. In this work, we extend the IMF procedure to solve for the tree-structured SB problem. Our resulting algorithm inherits the many advantages of IMF over IPF approaches in the tree-based setting. In the case of Wasserstein barycentres, our approach can be viewed as extending the widely used fixed-point approach to use flow-based entropic OT solvers, while requiring only simple bridge-matching steps at each iteration.
comment: NeurIPS 2025
♻ ☆ SplashNet: Split-and-Share Encoders for Accurate and Efficient Typing with Surface Electromyography
Surface electromyography (sEMG) at the wrists could enable natural, keyboard-free text entry, yet the state-of-the-art emg2qwerty baseline still misrecognizes $51.8\%$ of characters in the zero-shot setting on unseen users and $7.0\%$ after user-specific fine-tuning. We trace many of these errors to mismatched cross-user signal statistics, fragile reliance on high-order feature dependencies, and the absence of architectural inductive biases aligned with the bilateral nature of typing. To address these issues, we introduce three simple modifications: (i) Rolling Time Normalization, which adaptively aligns input distributions across users; (ii) Aggressive Channel Masking, which encourages reliance on low-order feature combinations more likely to generalize across users; and (iii) a Split-and-Share encoder that processes each hand independently with weight-shared streams to reflect the bilateral symmetry of the neuromuscular system. Combined with a five-fold reduction in spectral resolution ($33\!\rightarrow\!6$ frequency bands), these components yield a compact Split-and-Share model, SplashNet-mini, which uses only $\tfrac14$ the parameters and $0.6\times$ the FLOPs of the baseline while reducing character-error rate (CER) to $36.4\%$ zero-shot and $5.9\%$ after fine-tuning. An upscaled variant, SplashNet ($\tfrac12$ the parameters, $1.15\times$ the FLOPs of the baseline), further lowers error to $35.7\%$ and $5.5\%$, representing relative improvements of $31\%$ and $21\%$ in the zero-shot and fine-tuned settings, respectively. SplashNet therefore establishes a new state of the art without requiring additional data.
♻ ☆ Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm
A Quantum Natural Gradient (QNG) algorithm for optimization of variational quantum circuits has been proposed recently. In this study, we employ the Langevin equation with a QNG stochastic force to demonstrate that its discrete-time solution gives a generalized form of the above-specified algorithm, which we call Momentum-QNG. Similar to other optimization algorithms with the momentum term, such as the Stochastic Gradient Descent with momentum, RMSProp with momentum and Adam, Momentum-QNG is more effective to escape local minima and plateaus in the variational parameter space and, therefore, demonstrates an improved performance compared to the basic QNG. In this paper we benchmark Momentum-QNG together with the basic QNG, Adam and Momentum optimizers and explore its convergence behaviour. Among the benchmarking problems studied, the best result is obtained for the quantum Sherrington-Kirkpatrick model in the strong spin glass regime. Our open-source code is available at https://github.com/borbysh/Momentum-QNG
comment: 11 pages, 3 figures
♻ ☆ CARMA: Collocation-Aware Resource Manager
GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.
♻ ☆ Audio Driven Real-Time Facial Animation for Social Telepresence SIGGRAPH
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
comment: SIGGRAPH Asia 2025. Project page: https://jiyewise.github.io/projects/AudioRTA
♻ ☆ Is Risk-Sensitive Reinforcement Learning Properly Resolved?
Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.
♻ ☆ On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
♻ ☆ Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models ICML 2025
A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
comment: Accepted to ICML 2025 (Oral)
♻ ☆ Synthetic Data Generation with Lorenzetti for Time Series Anomaly Detection in High-Energy Physics Calorimeters
Anomaly detection in multivariate time series is crucial to ensure the quality of data coming from a physics experiment. Accurately identifying the moments when unexpected errors or defects occur is essential, yet challenging due to scarce labels, unknown anomaly types, and complex correlations across dimensions. To address the scarcity and unreliability of labelled data, we use the Lorenzetti Simulator to generate synthetic events with injected calorimeter anomalies. We then assess the sensitivity of several time series anomaly detection methods, including transformer-based and other deep learning models. The approach employed here is generic and applicable to different detector designs and defects.
comment: 4 pages, 2 figures, Submission to SciPost proceedings for EuCAIFCon 2025
♻ ☆ RL Fine-Tuning Heals OOD Forgetting in SFT
The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim "SFT memorizes, RL generalizes" is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT
comment: 24 pages, 18 figures
♻ ☆ Transferring Linear Features Across Language Models With Model Stitching
In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
♻ ☆ A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Transformer-based architectures have recently advanced the image reconstruction quality of super-resolution (SR) models. Yet, their scalability remains limited by quadratic attention costs and coarse patch embeddings that weaken pixel-level fidelity. We propose TaylorIR, a plug-and-play framework that enforces 1x1 patch embeddings for true pixel-wise reasoning and replaces conventional self-attention with TaylorShift, a Taylor-series-based attention mechanism enabling full token interactions with near-linear complexity. Across multiple SR benchmarks, TaylorIR delivers state-of-the-art performance while reducing memory consumption by up to 60%, effectively bridging the gap between fine-grained detail restoration and efficient transformer scaling.
♻ ☆ Knolling Bot: Teaching Robots the Human Notion of Tidiness NeurIPS 2025
For robots to truly collaborate and assist humans, they must understand not only logic and instructions, but also the subtle emotions, aesthetics, and feelings that define our humanity. Human art and aesthetics are among the most elusive concepts-often difficult even for people to articulate-and without grasping these fundamentals, robots will be unable to help in many spheres of daily life. Consider the long-promised robotic butler: automating domestic chores demands more than motion planning. It requires an internal model of cleanliness and tidiness-a challenge largely unexplored by AI. To bridge this gap, we propose an approach that equips domestic robots to perform simple tidying tasks via knolling, the practice of arranging scattered items into neat, space-efficient layouts. Unlike the uniformity of industrial settings, household environments feature diverse objects and highly subjective notions of tidiness. Drawing inspiration from NLP, we treat knolling as a sequential prediction problem and employ a transformer based model to forecast each object's placement. Our method learns a generalizable concept of tidiness, generates diverse solutions adaptable to varying object sets, and incorporates human preferences for personalized arrangements. This work represents a step forward in building robots that internalize human aesthetic sense and can genuinely co-create in our living spaces.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track
♻ ☆ Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence
An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) or Latent Semantic Analysis (LSA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic Model (MGTM). In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Applying EM algorithm on a document corpus, the various constituent Multivariate Gaussian distributions corresponding to the latent topics and their respective parameters are identified. Analysis of the parameters of each distribution helps identify the respective topic keywords, and from these key-words topic annotations are carried out. This approach is applied on 20 newsgroups dataset to demonstrate the interpretability benefits vis-`a-vis 4 other benchmark models. The effectiveness of this model in capturing the semantic theme of the topics with high interpretability is examined by calculating the topic coherence and comparing the coherence values with benchmark models. This model achieves a highest mean topic coherence (0.7) and median topic coherence (0.76) vis-`a-vis the benchmark models, demonstrating high effectiveness in identifying interpretable, semantically coherent topics.
comment: 22 pages
♻ ☆ Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
comment: Kimi Linear tech report
♻ ☆ Lorica: A Synergistic Fine-Tuning Framework for Advancing Personalized Adversarial Robustness CCS
The growing use of large pre-trained models in edge computing has made model inference on mobile clients both feasible and popular. Yet these devices remain vulnerable to adversarial attacks, threatening model robustness and security. Federated adversarial training (FAT) offers a promising solution by enhancing robustness while preserving client privacy. However, FAT often yields a generalized global model that struggles with heterogeneous client data, leading to limited personalization and significant communication overhead. In this paper, we propose \textit{Lorica}, a personalized synergistic adversarial training framework that delivers customized defense models through a two-phase process. In Phase 1, \textit{Lorica} applies LoRA-FA for local adversarial fine-tuning, enabling personalized robustness while reducing communication by uploading only LoRA-FA parameters. In Phase 2, a forward-gating selection strategy improves benign accuracy, further refining the personalized model. This yields tailored defense models that effectively balance robustness and accuracy. Extensive experiments on benchmark datasets demonstrate that \textit{Lorica} can achieve up to 68$\times$ improvements in communication efficiency compared to state-of-the-art algorithms, while achieving up to 29.9\% and 52.2\% enhancements in adversarial robustness and benign accuracy, respectively.
comment: Accepted by the ACM Conference on Computer and Communications Security (CCS) 2025
♻ ☆ THFlow: A Temporally Hierarchical Flow Matching Framework for 3D Peptide Design
Deep generative models provide a promising approach to de novo 3D peptide design. Most of them jointly model the distributions of peptide's position, orientation, and conformation, attempting to simultaneously converge to the target pocket. However, in the early stage of docking, optimizing conformation-only modalities such as rotation and torsion can be physically meaningless, as the peptide is initialized far from the protein pocket and no interaction field is present. We define this problem as the multimodal temporal inconsistency problem and claim it is a key factor contributing to low binding affinity in generated peptides. To address this challenge, we propose THFlow, a novel flow matching-based multimodal generative model that explicitly models the temporal hierarchy between peptide position and conformation. It employs a polynomial based conditional flow to accelerate positional convergence early on, and later aligns it with rotation and torsion for coordinated conformation refinement under the emerging interaction field. Additionally, we incorporate interaction-related features, such as polarity, to further enhance the model's understanding of peptide-protein binding. Extensive experiments demonstrate that THFlow outperforms existing methods in generating peptides with superior stability, affinity, and diversity, offering an effective and accurate solution for advancing peptide-based therapeutic development.
♻ ☆ Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification NeurIPS 2025
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
comment: Accepted to the Reliable ML from Unreliable Data workshop at NeurIPS 2025 (ReliableML@NeurIPS)
♻ ☆ SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV 2025
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
comment: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
♻ ☆ ADPO: Anchored Direct Preference Optimization
Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a generalized framework that learns from soft, listwise supervision by anchoring policy updates to a reference model. Our key theoretical contribution is to show that this anchoring mechanism imposes an implicit trust region on the policy update, enforced by the softmax Fisher information metric. This provides a robust geometric interpretation for both fixed and dynamic anchor strategies. Our central empirical finding is a task-dependent tradeoff between anchor update strategies. Through controlled experiments across twelve scenarios and two MuJoCo environments, we demonstrate that (1) for online exploration in noisy environments, a dynamic anchor that tracks the learning policy is superior, improving performance by 5 to 11 percent over a fixed anchor; and (2) for offline distillation, a fixed anchor pointing to the teacher policy is dramatically more effective, achieving returns of 206.7 on HalfCheetah-v5 (387 percent of teacher) and 65.4 on Hopper-v5 (61 percent of teacher), while reducing KL divergence to the teacher by up to 5000 times compared with standard knowledge distillation. These findings offer clear, practical guidance for selecting anchor strategies and establish ADPO as a robust, unified framework for preference learning. Larger models further amplify ADPO's benefits (0.718 vs. 0.416 at hidden dimension 256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.
♻ ☆ MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification EMNLP 2025
We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques -- heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch -- resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.
comment: This is the camera-ready version of the paper, accepted for publication in the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
♻ ☆ Exploring Kolmogorov-Arnold Networks for Interpretable Time Series Classification
Time series classification is a relevant step supporting decision-making processes in various domains, and deep neural models have shown promising performance in this respect. Despite significant advancements in deep learning, the theoretical understanding of how and why complex architectures function remains limited, prompting the need for more interpretable models. Recently, the Kolmogorov-Arnold Networks (KANs) have been proposed as a more interpretable alternative to deep learning. While KAN-related research is significantly rising, to date, the study of KAN architectures for time series classification has been limited. In this paper, we aim to conduct a comprehensive and robust exploration of the KAN architecture for time series classification utilising 117 datasets from UCR benchmark archive, from multiple different domains. More specifically, we investigate a) the transferability of reference architectures designed for regression to classification tasks, b) identifying the hyperparameter and implementation configurations for an architecture that best generalizes across 117 datasets, c) the associated complexity trade-offs and d) evaluate KANs interpretability. Our results demonstrate that (1) the Efficient KAN outperforms MLPs in both performance and training times, showcasing its suitability for classification tasks. (2) Efficient KAN exhibits greater stability than the original KAN across grid sizes, depths, and layer configurations, especially when lower learning rates are employed. (3) KAN achieves competitive accuracy compared to state-of-the-art models such as HIVE-COTE2 and InceptionTime, while maintaining smaller architectures and faster training times, highlighting its favorable balance of performance and transparency. (4) The interpretability of the KAN model, as confirmed by SHAP analysis, reinforces its capacity for transparent decision-making.
♻ ☆ Exploring the Hidden Capacity of LLMs for One-Step Text Generation EMNLP2025
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space - suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.
comment: accepted to EMNLP2025 main
♻ ☆ R+R: Revisiting Static Feature-Based Android Malware Detection using Machine Learning
Static feature-based Android malware detection using machine learning (ML) remains critical due to its scalability and efficiency. However, existing approaches often overlook security-critical reproducibility concerns, such as dataset duplication, inadequate hyperparameter tuning, and variance from random initialization. This can significantly compromise the practical effectiveness of these systems. In this paper, we systematically investigate these challenges by proposing a more rigorous methodology for model selection and evaluation. Using two widely used datasets, Drebin and APIGraph, we evaluate six ML models of varying complexity under both offline and continuous active learning settings. Our analysis demonstrates that, contrary to popular belief, well-tuned, simpler models, particularly tree-based methods like XGBoost, consistently outperform more complex neural networks, especially when duplicates are removed. To promote transparency and reproducibility, we open-source our codebase, which is extensible for integrating new models and datasets, facilitating reproducible security research.
♻ ☆ Chronic Diseases Prediction using Machine Learning and Deep Learning Methods
Chronic diseases, such as cardiovascular disease, diabetes, chronic kidney disease, and thyroid disorders, are the leading causes of premature mortality worldwide. Early detection and intervention are crucial for improving patient outcomes, yet traditional diagnostic methods often fail due to the complex nature of these conditions. This study explores the application of machine learning (ML) and deep learning (DL) techniques to predict chronic disease and thyroid disorders. We used a variety of models, including Logistic Regression (LR), Random Forest (RF), Gradient Boosted Trees (GBT), Neural Networks (NN), Decision Trees (DT) and Native Bayes (NB), to analyze and predict disease outcomes. Our methodology involved comprehensive data pre-processing, including handling missing values, categorical encoding, and feature aggregation, followed by model training and evaluation. Performance metrics such ad precision, recall, accuracy, F1-score, and Area Under the Curve (AUC) were used to assess the effectiveness of each model. The results demonstrated that ensemble methods like Random Forest and Gradient Boosted Trees consistently outperformed. Neutral Networks also showed superior performance, particularly in capturing complex data patterns. The findings highlight the potential of ML and DL in revolutionizing chronic disease prediction, enabling early diagnosis and personalized treatment strategies. However, challenges such as data quality, model interpretability, and the need for advanced computational techniques in healthcare to improve patient outcomes and reduce the burden of chronic diseases. This study was conducted as part of Big Data class project under the supervision of our professors Mr. Abderrahmane EZ-ZAHOUT and Mr. Abdessamad ESSAIDI.
comment: The authors withdraw this paper following our supervisor's recommendation. Major modifications to the methodology and analysis are required, and the current version does not reflect the final validated work. A revised version will be submitted after completing the improvements
♻ ☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
♻ ☆ Neural Network Based Framework for Passive Intermodulation Cancellation in MIMO Systems
Passive intermodulation (PIM) has emerged as a critical source of self-interference in modern MIMO-OFDM systems, especially under the stringent requirements of 5G and beyond. Conventional cancellation methods often rely on complex nonlinear models with limited scalability and high computational cost. In this work, we propose a lightweight deep learning framework for PIM cancellation that leverages depthwise separable convolutions and dilated convolutions to efficiently capture nonlinear dependencies across antennas and subcarriers. To further enhance convergence, we adopt a cyclic learning rate schedule and gradient clipping. In a controlled MIMO experimental setup, the method effectively suppresses third-order passive intermodulation (PIM) distortion, achieving up to 29dB of average power error (APE) with only 11k trainable parameters. These results highlight the potential of compact neural architectures for scalable interference mitigation in future wireless communication systems.
♻ ☆ DMol: A Highly Efficient and Chemical Motif-Preserving Molecule Generation Platform
We introduce a new graph diffusion model for small molecule generation, DMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10-fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a graph noise scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2%, increases the novelty of the method, and further improves the running time due to reductions in the graph size.
♻ ☆ IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? AACL 2025
Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].
comment: 25 pages, 11 figures, Accepted at IJCNLP-AACL 2025 Findings
♻ ☆ Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused exclusively on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity between them. Yet, abstracts offer only sparse and high-level summaries, and such methods primarily optimize one-to-one similarity, overlooking the dynamic relations that emerge among relevant papers during the retrieval process. To address this, we propose Chain of Retrieval(COR), a novel iterative framework for full-paper retrieval. Specifically, CoR decomposes each query paper into multiple aspect-specific views, matches them against segmented candidate papers, and iteratively expands the search by promoting top-ranked results as new queries, thereby forming a tree-structured retrieval process. The resulting retrieval tree is then aggregated in a post-order manner: descendants are first combined at the query level, then recursively merged with their parent nodes, to capture hierarchical relations across iterations. To validate this, we present SCIFULLBENCH, a large-scale benchmark providing both complete and segmented contexts of full papers for queries and candidates, and results show that CoR significantly outperforms existing retrieval baselines. Our code and dataset is available at https://github.com/psw0021/Chain-of-Retrieval.git.
♻ ☆ Dequantified Diffusion-Schr{ö}dinger Bridge for Density Ratio Estimation
Density ratio estimation is fundamental to tasks involving $f$-divergences, yet existing methods often fail under significantly different distributions or inadequately overlapping supports -- the density-chasm and the support-chasm problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We design $\textbf{D}^3\textbf{RE}$, a unified framework for \textbf{robust}, \textbf{stable} and \textbf{efficient} density ratio estimation. We propose the dequantified diffusion bridge interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the proposed dequantified Schr{\"o}dinger bridge interpolant (DSBI) incorporates optimal transport to solve the Schr{\"o}dinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.
♻ ☆ PolyG: Adaptive Graph Traversal for Diverse GraphRAG Questions
GraphRAG enhances large language models (LLMs) to generate quality answers for user questions by retrieving related facts from external knowledge graphs. However, current GraphRAG methods are primarily evaluated on and overly tailored for knowledge graph question answering (KGQA) benchmarks, which are biased towards a few specific question patterns and do not reflect the diversity of real-world questions. To better evaluate GraphRAG methods, we propose a complete four-class taxonomy to categorize the basic patterns of knowledge graph questions and use it to create PolyBench, a new GraphRAG benchmark encompassing a comprehensive set of graph questions. With the new benchmark, we find that existing GraphRAG methods fall short in effectiveness (i.e., quality of the generated answers) and/or efficiency (i.e., response time or token usage) because they adopt either a fixed graph traversal strategy or free-form exploration by LLMs for fact retrieval. However, different question patterns require distinct graph traversal strategies and context formation. To facilitate better retrieval, we propose PolyG, an adaptive GraphRAG approach by decomposing and categorizing the questions according to our proposed question taxonomy. Built on top of a unified interface and execution engine, PolyG dynamically prompts an LLM to generate a graph database query to retrieve the context for each decomposed basic question. Compared with SOTA GraphRAG methods, PolyG achieves a higher win rate in generation quality and has a low response latency and token cost. Our code and benchmark are open-source at https://github.com/Liu-rj/PolyG.
♻ ☆ MMbeddings: Parameter-Efficient, Low-Overfitting Probabilistic Embeddings Inspired by Nonlinear Mixed Models
We present MMbeddings, a probabilistic embedding approach that reinterprets categorical embeddings through the lens of nonlinear mixed models, effectively bridging classical statistical theory with modern deep learning. By treating embeddings as latent random effects within a variational autoencoder framework, our method substantially decreases the number of parameters -- from the conventional embedding approach of cardinality $\times$ embedding dimension, which quickly becomes infeasible with large cardinalities, to a significantly smaller, cardinality-independent number determined primarily by the encoder architecture. This reduction dramatically mitigates overfitting and computational burden in high-cardinality settings. Extensive experiments on simulated and real datasets, encompassing collaborative filtering and tabular regression tasks using varied architectures, demonstrate that MMbeddings consistently outperforms traditional embeddings, underscoring its potential across diverse machine learning applications.
♻ ☆ Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships
While interpretability methods identify a model's learned concepts, they overlook the relationships between concepts that make up its abstractions and inform its ability to generalize to new data. To assess whether models' have learned human-aligned abstractions, we introduce abstraction alignment, a methodology to compare model behavior against formal human knowledge. Abstraction alignment externalizes domain-specific human knowledge as an abstraction graph, a set of pertinent concepts spanning levels of abstraction. Using the abstraction graph as a ground truth, abstraction alignment measures the alignment of a model's behavior by determining how much of its uncertainty is accounted for by the human abstractions. By aggregating abstraction alignment across entire datasets, users can test alignment hypotheses, such as which human concepts the model has learned and where misalignments recur. In evaluations with experts, abstraction alignment differentiates seemingly similar errors, improves the verbosity of existing model-quality metrics, and uncovers improvements to current human abstractions.
comment: 20 pages, 7 figures, published in CHI 2025
♻ ☆ Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search
Efficient optimization remains a fundamental challenge across numerous scientific and engineering domains, especially when objective function and gradient evaluations are computationally expensive. While zeroth-order optimization methods offer effective approaches when gradients are inaccessible, their practical performance can be limited by the high cost associated with function queries. This work introduces the bi-fidelity stochastic subspace descent (BF-SSD) algorithm, a novel zeroth-order optimization method designed to reduce this computational burden. BF-SSD leverages a bi-fidelity framework, constructing a surrogate model from a combination of computationally inexpensive low-fidelity (LF) and accurate high-fidelity (HF) function evaluations. This surrogate model facilitates an efficient backtracking line search for step size selection, for which we provide theoretical convergence guarantees under standard assumptions. We perform a comprehensive empirical evaluation of BF-SSD across four distinct problems: a synthetic optimization benchmark, dual-form kernel ridge regression, black-box adversarial attacks on machine learning models, and transformer-based black-box language model fine-tuning. Numerical results demonstrate that BF-SSD consistently achieves superior optimization performance while requiring significantly fewer HF function evaluations compared to relevant baseline methods. This study highlights the efficacy of integrating bi-fidelity strategies within zeroth-order optimization, positioning BF-SSD as a promising and computationally efficient approach for tackling large-scale, high-dimensional problems encountered in various real-world applications.
♻ ☆ Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency NeurIPS 2025
The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce excessive computational overhead. In this paper, we introduce Guard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_{k-1} + 2$, while preserving their $1$-consistency. Guard achieves the current best-known trade-off between consistency and robustness, with only O(1) additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of Guard in practice.
comment: Accepted to NeurIPS 2025. https://neurips.cc/virtual/2025/poster/116615
♻ ☆ PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning
Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textit{PrunedLoRA}, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.
♻ ☆ TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding
Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN shown to be effective compared to existing methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Finally, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: https://github.com/HySonLab/TESGraph
comment: arXiv admin note: text overlap with arXiv:2407.00609
♻ ☆ Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training NeurIPS 2025
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
comment: Published at NeurIPS 2025
♻ ☆ Continual Learning with Query-Only Attention
Continual learning involves learning from a stream of data without repetition of data points, a scenario that is inherently complex due to distributional shift across tasks. We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures. In continual learning scenarios, this simplified mechanism significantly mitigates both loss of plasticity and catastrophic forgetting, outperforming baselines such as selective re-initialization. We establish a conceptual link between query-only attention, full transformer attention, and model agnostic meta-learning, framing them as instances of meta-learning. We further provide intuition for why query-based models and attention networks help preserve plasticity in continual settings. Finally, through preliminary Hessian spectrum analysis, we observe that models maintaining higher curvature rank across tasks tend to retain plasticity. Our findings suggest that full attention may not be essential for capturing the benefits of meta-learning in continual learning.
♻ ☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
comment: 50 pages
♻ ☆ Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso
I develop an algorithm to produce the piecewise quadratic that computes leave-one-out cross-validation for the lasso as a function of its hyperparameter. The algorithm can be used to find exact hyperparameters that optimize leave-one-out cross-validation either globally or locally, and its practicality is demonstrated on real-world data sets. I also show how the algorithm can be modified to compute approximate leave-one-out cross-validation, making it suitable for larger data sets.
comment: 20 pages, 4 figures, 7 tables
♻ ☆ Split Gibbs Discrete Diffusion Posterior Sampling NeurIPS 2025
We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SGDD. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate the convergence of SGDD to the target posterior distribution and verify this through controlled experiments on synthetic benchmarks. Our method enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, including DNA sequence design, discrete image inverse problems, and music infilling, achieving more than 30% improved performance compared to existing baselines. Our code is available at https://github.com/chuwd19/Split-Gibbs-Discrete-Diffusion-Posterior-Sampling.
comment: Accepted to NeurIPS 2025
♻ ☆ Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence ICML 2025
Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs (GCN, GIN, and GatedGCN) enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus.
comment: ICML 2025
♻ ☆ Is Grokking a Computational Glass Relaxation?
Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
♻ ☆ Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
♻ ☆ InputDSA: Demixing then Comparing Recurrent and Externally Driven Dynamics
In control problems and basic scientific modeling, it is important to compare observations with dynamical simulations. For example, comparing two neural systems can shed light on the nature of emergent computations in the brain and deep neural networks. Recently, Ostrow et al. (2023) introduced Dynamical Similarity Analysis (DSA), a method to measure the similarity of two systems based on their recurrent dynamics rather than geometry or topology. However, DSA does not consider how inputs affect the dynamics, meaning that two similar systems, if driven differently, may be classified as different. Because real-world dynamical systems are rarely autonomous, it is important to account for the effects of input drive. To this end, we introduce a novel metric for comparing both intrinsic (recurrent) and input-driven dynamics, called InputDSA (iDSA). InputDSA extends the DSA framework by estimating and comparing both input and intrinsic dynamic operators using a variant of Dynamic Mode Decomposition with control (DMDc) based on subspace identification. We demonstrate that InputDSA can successfully compare partially observed, input-driven systems from noisy data. We show that when the true inputs are unknown, surrogate inputs can be substituted without a major deterioration in similarity estimates. We apply InputDSA on Recurrent Neural Networks (RNNs) trained with Deep Reinforcement Learning, identifying that high-performing networks are dynamically similar to one another, while low-performing networks are more diverse. Lastly, we apply InputDSA to neural data recorded from rats performing a cognitive task, demonstrating that it identifies a transition from input-driven evidence accumulation to intrinsically-driven decision-making. Our work demonstrates that InputDSA is a robust and efficient method for comparing intrinsic dynamics and the effect of external input on dynamical systems.
comment: 36 pages, 14 figures
♻ ☆ Penalized Empirical Likelihood for Doubly Robust Causal Inference under Contamination in High Dimensions
We propose a doubly robust estimator for the average treatment effect in high dimensional low sample size observational studies, where contamination and model misspecification pose serious inferential challenges. The estimator combines bounded influence estimating equations for outcome modeling with covariate balancing propensity scores for treatment assignment, embedded within a penalized empirical likelihood framework using nonconvex regularization. It satisfies the oracle property by jointly achieving consistency under partial model correct ness, selection consistency, robustness to contamination, and asymptotic normality. For uncertainty quantification, we derive a finite sample confidence interval using cumulant generating functions and influence function corrections, avoiding reliance on asymptotic approximations. Simulation studies and applications to gene expression datasets (Golub and Khan) demonstrate superior performance in bias, error metrics, and interval calibration, highlighting the method robustness and inferential validity in HDLSS regimes. One notable aspect is that even in the absence of contamination, the proposed estimator and its confidence interval remain efficient compared to those of competing models.
♻ ☆ Beyond PCA: Manifold Dimension Estimation via Local Graph Structure
Local principal component analysis (Local PCA) has proven to be an effective tool for estimating the intrinsic dimension of a manifold. More recently, curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly accounting for the curvature of the underlying manifold, rather than assuming local flatness. Building on these insights, we propose a general framework for manifold dimension estimation that captures the manifold's local graph structure by integrating PCA with regression-based techniques. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art alternatives.
Graphics 6
☆ Applying Medical Imaging Tractography Techniques to Painterly Rendering of Images
Doctors and researchers routinely use diffusion tensor imaging (DTI) and tractography to visualize the fibrous structure of tissues in the human body. This paper explores the connection of these techniques to the painterly rendering of images. Using a tractography algorithm the presented method can place brush strokes that mimic the painting process of human artists, analogously to how fibres are tracked in DTI. The analogue to the diffusion tensor for image orientation is the structural tensor, which can provide better local orientation information than the gradient alone. I demonstrate this technique in portraits and general images, and discuss the parallels between fibre tracking and brush stroke placement, and frame it in the language of tractography. This work presents an exploratory investigation into the cross-domain application of diffusion tensor imaging techniques to painterly rendering of images. All the code is available at https://github.com/tito21/st-python
comment: Exploratory investigation applying medical imaging tractography techniques to painterly image rendering. Code available at https://github.com/tito21/st-python
☆ Image-based ground distance detection for crop-residue-covered soil
Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.
comment: under review at Computers and Electronics in Agriculture
☆ Oitijjo-3D: Generative AI Framework for Rapid 3D Heritage Reconstruction from Street View Imagery CEC
Cultural heritage restoration in Bangladesh faces a dual challenge of limited resources and scarce technical expertise. Traditional 3D digitization methods, such as photogrammetry or LiDAR scanning, require expensive hardware, expert operators, and extensive on-site access, which are often infeasible in developing contexts. As a result, many of Bangladesh's architectural treasures, from the Paharpur Buddhist Monastery to Ahsan Manzil, remain vulnerable to decay and inaccessible in digital form. This paper introduces Oitijjo-3D, a cost-free generative AI framework that democratizes 3D cultural preservation. By using publicly available Google Street View imagery, Oitijjo-3D reconstructs faithful 3D models of heritage structures through a two-stage pipeline - multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery. The system produces photorealistic, metrically coherent reconstructions in seconds, achieving significant speedups compared to conventional Structure-from-Motion pipelines, without requiring any specialized hardware or expert supervision. Experiments on landmarks such as Ahsan Manzil, Choto Sona Mosque, and Paharpur demonstrate that Oitijjo-3D preserves both visual and structural fidelity while drastically lowering economic and technical barriers. By turning open imagery into digital heritage, this work reframes preservation as a community-driven, AI-assisted act of cultural continuity for resource-limited nations.
comment: 6 Pages, 4 figures, 2 Tables, Submitted to ICECTE 2026
♻ ☆ Audio Driven Real-Time Facial Animation for Social Telepresence SIGGRAPH
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
comment: SIGGRAPH Asia 2025. Project page: https://jiyewise.github.io/projects/AudioRTA
♻ ☆ Anti-Aliased 2D Gaussian Splatting NeurIPS 2025
2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an anti-aliased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world-space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object-space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
comment: NeurIPS 2025. Code will be available at https://github.com/maeyounes/AA-2DGS
♻ ☆ SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV 2025
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
comment: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
Robotics 31
☆ Whole-Body Proprioceptive Morphing: A Modular Soft Gripper for Robust Cross-Scale Grasping
Biological systems, such as the octopus, exhibit masterful cross-scale manipulation by adaptively reconfiguring their entire form, a capability that remains elusive in robotics. Conventional soft grippers, while compliant, are mostly constrained by a fixed global morphology, and prior shape-morphing efforts have been largely confined to localized deformations, failing to replicate this biological dexterity. Inspired by this natural exemplar, we introduce the paradigm of collaborative, whole-body proprioceptive morphing, realized in a modular soft gripper architecture. Our design is a distributed network of modular self-sensing pneumatic actuators that enables the gripper to intelligently reconfigure its entire topology, achieving multiple morphing states that are controllable to form diverse polygonal shapes. By integrating rich proprioceptive feedback from embedded sensors, our system can seamlessly transition from a precise pinch to a large envelope grasp. We experimentally demonstrate that this approach expands the grasping envelope and enhances generalization across diverse object geometries (standard and irregular) and scales (up to 10$\times$), while also unlocking novel manipulation modalities such as multi-object and internal hook grasping. This work presents a low-cost, easy-to-fabricate, and scalable framework that fuses distributed actuation with integrated sensing, offering a new pathway toward achieving biological levels of dexterity in robotic manipulation.
☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
comment: 20 pages, 10 figures
☆ Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
☆ EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
comment: 9 pages, 6 figures, 4 tables
☆ Preliminary Prototyping of Avoidance Behaviors Triggered by a User's Physical Approach to a Robot
Human-robot interaction frequently involves physical proximity or contact. In human-human settings, people flexibly accept, reject, or tolerate such approaches depending on the relationship and context. We explore the design of a robot's rejective internal state and corresponding avoidance behaviors, such as withdrawing or pushing away, when a person approaches. We model the accumulation and decay of discomfort as a function of interpersonal distance, and implement tolerance (endurance) and limit-exceeding avoidance driven by the Dominance axis of the PAD affect model. The behaviors and their intensities are realized on an arm robot. Results illustrate a coherent pipeline from internal state parameters to graded endurance motions and, once a limit is crossed, to avoidance actions.
comment: Workshop on Socially Aware and Cooperative Intelligent Systems in HAI 2025
☆ Learning Soft Robotic Dynamics with Active Exploration
Soft robots offer unmatched adaptability and safety in unstructured environments, yet their compliant, high-dimensional, and nonlinear dynamics make modeling for control notoriously difficult. Existing data-driven approaches often fail to generalize, constrained by narrowly focused task demonstrations or inefficient random exploration. We introduce SoftAE, an uncertainty-aware active exploration framework that autonomously learns task-agnostic and generalizable dynamics models of soft robotic systems. SoftAE employs probabilistic ensemble models to estimate epistemic uncertainty and actively guides exploration toward underrepresented regions of the state-action space, achieving efficient coverage of diverse behaviors without task-specific supervision. We evaluate SoftAE on three simulated soft robotic platforms -- a continuum arm, an articulated fish in fluid, and a musculoskeletal leg with hybrid actuation -- and on a pneumatically actuated continuum soft arm in the real world. Compared with random exploration and task-specific model-based reinforcement learning, SoftAE produces more accurate dynamics models, enables superior zero-shot control on unseen tasks, and maintains robustness under sensing noise, actuation delays, and nonlinear material effects. These results demonstrate that uncertainty-driven active exploration can yield scalable, reusable dynamics models across diverse soft robotic morphologies, representing a step toward more autonomous, adaptable, and data-efficient control in compliant robots.
☆ Towards a Multi-Embodied Grasping Agent
Multi-embodiment grasping focuses on developing approaches that exhibit generalist behavior across diverse gripper designs. Existing methods often learn the kinematic structure of the robot implicitly and face challenges due to the difficulty of sourcing the required large-scale data. In this work, we present a data-efficient, flow-based, equivariant grasp synthesis architecture that can handle different gripper types with variable degrees of freedom and successfully exploit the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry. Unlike previous equivariant grasping methods, we translated all modules from the ground up to JAX and provide a model with batching capabilities over scenes, grippers, and grasps, resulting in smoother learning, improved performance and faster inference time. Our dataset encompasses grippers ranging from humanoid hands to parallel yaw grippers and includes 25,000 scenes and 20 million grasps.
comment: 9 pages, 3 figures
☆ Modified-Emergency Index (MEI): A Criticality Metric for Autonomous Driving in Lateral Conflict
Effective, reliable, and efficient evaluation of autonomous driving safety is essential to demonstrate its trustworthiness. Criticality metrics provide an objective means of assessing safety. However, as existing metrics primarily target longitudinal conflicts, accurately quantifying the risks of lateral conflicts - prevalent in urban settings - remains challenging. This paper proposes the Modified-Emergency Index (MEI), a metric designed to quantify evasive effort in lateral conflicts. Compared to the original Emergency Index (EI), MEI refines the estimation of the time available for evasive maneuvers, enabling more precise risk quantification. We validate MEI on a public lateral conflict dataset based on Argoverse-2, from which we extract over 1,500 high-quality AV conflict cases, including more than 500 critical events. MEI is then compared with the well-established ACT and the widely used PET metrics. Results show that MEI consistently outperforms them in accurately quantifying criticality and capturing risk evolution. Overall, these findings highlight MEI as a promising metric for evaluating urban conflicts and enhancing the safety assessment framework for autonomous driving. The open-source implementation is available at https://github.com/AutoChengh/MEI.
☆ A Modular and Scalable System Architecture for Heterogeneous UAV Swarms Using ROS 2 and PX4-Autopilot
In this paper a modular and scalable architecture for heterogeneous swarm-based Counter Unmanned Aerial Systems (C-UASs) built on PX4-Autopilot and Robot Operating System 2 (ROS 2) framework is presented. The proposed architecture emphasizes seamless integration of hardware components by introducing independent ROS 2 nodes for each component of a Unmanned Aerial Vehicle (UAV). Communication between swarm participants is abstracted in software, allowing the use of various technologies without architectural changes. Key functionalities are supported, e.g. leader following and formation flight to maneuver the swarm. The system also allows computer vision algorithms to be integrated for the detection and tracking of UAVs. Additionally, a ground station control is integrated for the coordination of swarm operations. Swarm-based Unmanned Aerial System (UAS) architecture is verified within a Gazebo simulation environment but also in real-world demonstrations.
☆ Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization of today's hardware, but parallelizing POMDP solvers has been challenging. They rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can quickly offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation that analytically solves part of the optimization component, leaving only the estimation of expectations for numerical computation. VOPP represents all data structures related to planning as a collection of tensors and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel solver with no dependencies and synchronization bottlenecks between parallel computations. Experimental results indicate that VOPP is at least 20X more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver.
comment: 8 pages, 3 figures. Submitted to ICRA 2026
☆ Hybrid Gripper Finger Enabling In-Grasp Friction Modulation Using Inflatable Silicone Pockets ICRA 2026
Grasping objects with diverse mechanical properties, such as heavy, slippery, or fragile items, remains a significant challenge in robotics. Conventional grippers often rely on applying high normal forces, which can cause damage to objects. To address this limitation, we present a hybrid gripper finger that combines a rigid structural shell with a soft, inflatable silicone pocket. The gripper finger can actively modulate its surface friction by controlling the internal air pressure of the silicone pocket. Results from fundamental experiments indicate that increasing the internal pressure results in a proportional increase in the effective coefficient of friction. This enables the gripper to stably lift heavy and slippery objects without increasing the gripping force and to handle fragile or deformable objects, such as eggs, fruits, and paper cups, with minimal damage by increasing friction rather than applying excessive force. The experimental results demonstrate that the hybrid gripper finger with adaptable friction provides a robust and safer alternative to relying solely on high normal forces, thereby enhancing the gripper flexibility in handling delicate, fragile, and diverse objects.
comment: Submitted to ICRA 2026
☆ MobiDock: Design and Control of A Modular Self Reconfigurable Bimanual Mobile Manipulator via Robotic Docking ICRA2026
Multi-robot systems, particularly mobile manipulators, face challenges in control coordination and dynamic stability when working together. To address this issue, this study proposes MobiDock, a modular self-reconfigurable mobile manipulator system that allows two independent robots to physically connect and form a unified mobile bimanual platform. This process helps transform a complex multi-robot control problem into the management of a simpler, single system. The system utilizes an autonomous docking strategy based on computer vision with AprilTag markers and a new threaded screw-lock mechanism. Experimental results show that the docked configuration demonstrates better performance in dynamic stability and operational efficiency compared to two independently cooperating robots. Specifically, the unified system has lower Root Mean Square (RMS) Acceleration and Jerk values, higher angular precision, and completes tasks significantly faster. These findings confirm that physical reconfiguration is a powerful design principle that simplifies cooperative control, improving stability and performance for complex tasks in real-world environments.
comment: ICRA2026 submited
☆ Confined Space Underwater Positioning Using Collaborative Robots
Positioning of underwater robots in confined and cluttered spaces remains a key challenge for field operations. Existing systems are mostly designed for large, open-water environments and struggle in industrial settings due to poor coverage, reliance on external infrastructure, and the need for feature-rich surroundings. Multipath effects from continuous sound reflections further degrade signal quality, reducing accuracy and reliability. Accurate and easily deployable positioning is essential for repeatable autonomous missions; however, this requirement has created a technological bottleneck limiting underwater robotic deployment. This paper presents the Collaborative Aquatic Positioning (CAP) system, which integrates collaborative robotics and sensor fusion to overcome these limitations. Inspired by the "mother-ship" concept, the surface vehicle acts as a mobile leader to assist in positioning a submerged robot, enabling localization even in GPS-denied and highly constrained environments. The system is validated in a large test tank through repeatable autonomous missions using CAP's position estimates for real-time trajectory control. Experimental results demonstrate a mean Euclidean distance (MED) error of 70 mm, achieved in real time without requiring fixed infrastructure, extensive calibration, or environmental features. CAP leverages advances in mobile robot sensing and leader-follower control to deliver a step change in accurate, practical, and infrastructure-free underwater localization.
comment: 31 pages including appendix, 24 figures
☆ WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.
comment: This paper has been accepted by MMM 2026
☆ Learning Generalizable Visuomotor Policy through Dynamics-Alignment
Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.
comment: 9 pages, 6 figures
♻ ☆ RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.
comment: To appear in Nature Machine Intelligence. Includes the main and supplementary manuscript. Total of 70 pages, with a total of 9 Figures and 17 Tables
♻ ☆ GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development cycle. This work introduces \textit{GenSwarm}, an end-to-end system that leverages large language models to automatically generate and deploy control policies for multi-robot tasks based on simple user instructions in natural language. As a multi-language-agent system, GenSwarm achieves zero-shot learning, enabling rapid adaptation to altered or unseen tasks. The white-box nature of the code policies ensures strong reproducibility and interpretability. With its scalable software and hardware architectures, GenSwarm supports efficient policy deployment on both simulated and real-world multi-robot systems, realizing an instruction-to-execution end-to-end functionality that could prove valuable for robotics specialists and non-specialists alike.The code of the proposed GenSwarm system is available online: https://github.com/WindyLab/GenSwarm.
comment: This article has been accepted for publication in npj Robotics
♻ ☆ A Study on Human-Swarm Interaction: A Framework for Assessing Situation Awareness and Task Performance
This paper introduces a framework for human swarm interaction studies that measures situation awareness in dynamic environments. A tablet-based interface was developed for a user study by implementing the concepts introduced in the framework, where operators guided a robotic swarm in a single-target search task, marking hazardous cells unknown to the swarm. Both subjective and objective situation awareness measures were used, with task performance evaluated based on how close the robots were to the target. The framework enabled a structured investigation of the role of situation awareness in human swarm interaction, leading to key findings such as improved task performance across attempts, showing the interface was learnable, centroid active robot position proved to be a useful task performance metric for assessing situation awareness, perception and projection played a key role in task performance, highlighting their importance in interface design and objective situation awareness influenced both subjective situation awareness and task performance, emphasizing the need for interfaces that emphasise objective situation awareness. These findings validate our framework as a structured approach for integrating situation awareness concepts into human swarm interaction studies, offering a systematic way to assess situation awareness and task performance. The framework can be applied to other swarming studies to evaluate interface learnability, identify meaningful task performance metrics, and refine interface designs to enhance situation awareness, ultimately improving human swarm interaction in dynamic environments.
comment: 10 pages, 8 figures, 2 tables, 2 equations
♻ ☆ Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
♻ ☆ A Tactile Feedback Approach to Path Recovery after High-Speed Impacts for Collision-Resilient Drones
Aerial robots are a well-established solution for exploration, monitoring, and inspection, thanks to their superior maneuverability and agility. However, in many environments, they risk crashing and sustaining damage after collisions. Traditional methods focus on avoiding obstacles entirely, but these approaches can be limiting, particularly in cluttered spaces or on weight-and compute-constrained platforms such as drones. This paper presents a novel approach to enhance drone robustness and autonomy by developing a path recovery and adjustment method for a high-speed collision-resilient aerial robot equipped with lightweight, distributed tactile sensors. The proposed system explicitly models collisions using pre-collision velocities, rates and tactile feedback to predict post-collision dynamics, improving state estimation accuracy. Additionally, we introduce a computationally efficient vector-field-based path representation that guarantees convergence to a user-specified path, while naturally avoiding known obstacles. Post-collision, contact point locations are incorporated into the vector field as a repulsive potential, enabling the drone to avoid obstacles while naturally returning to its path. The effectiveness of this method is validated through Monte Carlo simulations and demonstrated on a physical prototype, showing successful path following, collision recovery, and adjustment at speeds up to 3.7 m/s.
♻ ☆ SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present SafeAgentBench -- the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available in https://github.com/shengyin1224/SafeAgentBench and https://huggingface.co/datasets/safeagentbench/SafeAgentBench.
comment: 28 pages, 19 tables, 15 figures
♻ ☆ From Canada to Japan: How 10,000 km Affect User Perception in Robot Teleoperation
Robot teleoperation (RTo) has emerged as a viable alternative to local control, particularly when human intervention is still necessary. This research aims to study the distance effect on user perception in RTo, exploring the potential of teleoperated robots for older adult care. We propose an evaluation of non-expert users' perception of long-distance RTo, examining how their perception changes before and after interaction, as well as comparing it to that of locally operated robots. We have designed a specific protocol consisting of multiple questionnaires, along with a dedicated software architecture using the Robotics Operating System (ROS) and Unity. The results revealed no statistically significant differences between the local and remote robot conditions, suggesting that robots may be a viable alternative to traditional local control.
comment: Author preprint - Accepted for Humanoids 2025
♻ ☆ A Practical-Driven Framework for Transitioning Drive-by-Wire to Autonomous Driving Systems: A Case Study with a Chrysler Pacifica Hybrid Vehicle
Transitioning from a Drive-by-Wire (DBW) system to a fully autonomous driving system (ADS) involves multiple stages of development and demands robust positioning and sensing capabilities. This paper presents a practice-driven framework for facilitating the DBW-to-ADS transition using a 2022 Chrysler Pacifica Hybrid Minivan equipped with cameras, LiDAR, GNSS, and onboard computing hardware configured with the Robot Operating System (ROS) and Autoware.AI. The implementation showcases offline autonomous operations utilizing pre-recorded LiDAR and camera data, point clouds, and vector maps, enabling effective localization and path planning within a structured test environment. The study addresses key challenges encountered during the transition, particularly those related to wireless-network-assisted sensing and positioning. It offers practical solutions for overcoming software incompatibility constraints, sensor synchronization issues, and limitations in real-time perception. Furthermore, the integration of sensing, data fusion, and automation is emphasized as a critical factor in supporting autonomous driving systems in map generation, simulation, and training. Overall, the transition process outlined in this work aims to provide actionable strategies for researchers pursuing DBW-to-ADS conversion. It offers direction for incorporating real-time perception, GNSS-LiDAR-camera integration, and fully ADS-equipped autonomous vehicle operations, thus contributing to the advancement of robust autonomous vehicle technologies.
comment: This updated version includes further implementation details and experimental validation. Accepted for presentation at The 22nd International Conference on Automation Technology (AUTOMATION 2025), Taipei, Taiwan, November 2025
♻ ☆ Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
♻ ☆ Sim2Real Diffusion: Leveraging Foundation Vision Language Models for Adaptive Automated Driving
Simulation-based design, optimization, and validation of autonomous vehicles have proven to be crucial for their improvement over the years. Nevertheless, the ultimate measure of effectiveness is their successful transition from simulation to reality (sim2real). However, existing sim2real transfer methods struggle to address the autonomy-oriented requirements of balancing: (i) conditioned domain adaptation, (ii) robust performance with limited examples, (iii) modularity in handling multiple domain representations, and (iv) real-time performance. To alleviate these pain points, we present a unified framework for learning cross-domain adaptive representations through conditional latent diffusion for sim2real transferable automated driving. Our framework offers options to leverage: (i) alternate foundation models, (ii) a few-shot fine-tuning pipeline, and (iii) textual as well as image prompts for mapping across given source and target domains. It is also capable of generating diverse high-quality samples when diffusing across parameter spaces such as times of day, weather conditions, seasons, and operational design domains. We systematically analyze the presented framework and report our findings in terms of performance benchmarks and ablation studies. Additionally, we demonstrate its serviceability for autonomous driving using behavioral cloning case studies. Our experiments indicate that the proposed framework is capable of bridging the perceptual sim2real gap by over 40%.
comment: Accepted in IEEE Robotics and Automation Letters (RA-L)
♻ ☆ Faster Model Predictive Control via Self-Supervised Initialization Learning
Model Predictive Control (MPC) is widely used in robot control by optimizing a sequence of control outputs over a finite-horizon. Computational approaches for MPC include deterministic methods (e.g., iLQR and COBYLA), as well as sampling-based methods (e.g., MPPI and CEM). However, complex system dynamics and non-convex or non-differentiable cost terms often lead to prohibitive optimization times that limit real-world deployment. Prior efforts to accelerate MPC have limitations on: (i) reusing previous solutions fails under sharp state changes and (ii) pure imitation learning does not target compute efficiency directly and suffers from suboptimality in the training data. To address these, We propose a warm-start framework that learns a policy to generate high-quality initial guesses for MPC solver. The policy is first trained via behavior cloning from expert MPC rollouts and then fine-tuned online with reinforcement learning to directly minimize MPC optimization time. We empirically validate that our approach improves both deterministic and sampling-based MPC methods, achieving up to 21.6% faster optimization and 34.1% more tracking accuracy for deterministic MPC in Formula 1 track path-tracking domain, and improving safety by 100%, path efficiency by 12.8%, and steering smoothness by 7.2% for sampling-based MPC in obstacle-rich navigation domain. These results demonstrate that our framework not only accelerates MPC but also improves overall control performance. Furthermore, it can be applied to a broader range of control algorithms that benefit from good initial guesses.
♻ ☆ LPAC: Learnable Perception-Action-Communication Loops with Applications to Coverage Control
Coverage control is the problem of navigating a robot swarm to collaboratively monitor features or a phenomenon of interest not known a priori. The problem is challenging in decentralized settings with robots that have limited communication and sensing capabilities. We propose a learnable Perception-Action-Communication (LPAC) architecture for the problem, wherein a convolutional neural network (CNN) processes localized perception; a graph neural network (GNN) facilitates robot communications; finally, a shallow multi-layer perceptron (MLP) computes robot actions. The GNN enables collaboration in the robot swarm by computing what information to communicate with nearby robots and how to incorporate received information. Evaluations show that the LPAC models -- trained using imitation learning -- outperform standard decentralized and centralized coverage control algorithms. The learned policy generalizes to environments different from the training dataset, transfers to larger environments with more robots, and is robust to noisy position estimates. The results indicate the suitability of LPAC architectures for decentralized navigation in robot swarms to achieve collaborative behavior.
comment: 20 Pages, 20 figures,
♻ ☆ ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning NeurIPS 2025
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
comment: NeurIPS 2025
♻ ☆ Curvature-Aware Calibration of Tactile Sensors for Accurate Force Estimation on Non-Planar Surfaces
Flexible tactile sensors are increasingly used in real-world applications such as robotic grippers, prosthetic hands, wearable gloves, and assistive devices, where they need to conform to curved and irregular surfaces. However, most existing tactile sensors are calibrated only on flat substrates, and their accuracy and consistency degrade once mounted on curved geometries. This limitation restricts their reliability in practical use. To address this challenge, we develop a calibration model for a widely used resistive tactile sensor design that enables accurate force estimation on one-dimensional curved surfaces. We then train a neural network (a multilayer perceptron) to predict local curvature from baseline sensor outputs recorded under no applied load, achieving an R2 score of 0.91. The proposed approach is validated on five daily objects with varying curvatures under forces from 2 N to 8 N. Results show that the curvature-aware calibration maintains consistent force accuracy across all surfaces, while flat-surface calibration underestimates force as curvature increases. Our results demonstrate that curvature-aware modeling improves the accuracy, consistency, and reliability of flexible tactile sensors, enabling dependable performance across real-world applications.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ ReactEMG: Zero-Shot, Low-Latency Intent Detection via sEMG
Surface electromyography (sEMG) signals show promise for effective human-computer interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly and reliably to user intent, across different subjects and without requiring time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. Unlike traditional gesture recognition models that wait until a gesture is completed before classifying it, our approach uses a segmentation strategy to assign intent labels at every timestep as the gesture unfolds. We introduce a novel masked modeling strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, considering both accuracy and stability for device control, our approach surpasses state-of-the-art performance in zero-shot transfer conditions, demonstrating its potential for wearable robotics and next-generation prosthetic systems. Our project page is available at: https://reactemg.github.io
♻ ☆ Vision-Based Online Key Point Estimation of Deformable Robots
The precise control of soft and continuum robots requires knowledge of their shape, which has, in contrast to classical rigid robots, infinite degrees of freedom. To partially reconstruct the shape, proprioceptive techniques use built-in sensors resulting in inaccurate results and increased fabrication complexity. Exteroceptive methods so far rely on expensive tracking systems with reflective markers placed on all components, which are infeasible for deformable robots interacting with the environment due to marker occlusion and damage. Here, a regression approach is presented for 3D key point estimation using a convolutional neural network. The proposed approach takes advantage of data-driven supervised learning and is capable of online marker-less estimation during inference. Two images of a robotic system are taken simultaneously at 25 Hz from two different perspectives, and are fed to the network, which returns for each pair the parameterized key point or PCC shape representations. The proposed approach outperforms marker-less state-of-the-art methods by a maximum of 4.5% in estimation accuracy while at the same time being more robust and requiring no prior knowledge of the shape. Online evaluations on two types of soft robotic arms and a soft robotic fish demonstrate our method's accuracy and versatility on highly deformable systems.
Computer Vision and Pattern Recognition 114
☆ LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar
Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.
☆ Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
☆ PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels.Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
comment: 12 pages,conference
☆ Deep learning denoising unlocks quantitative insights in operando materials microscopy
Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
☆ Imbalanced Classification through the Lens of Spurious Correlations
Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
☆ Gaussian Combined Distance: A Generic Metric for Object Detection
In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model's generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at https://github.com/MArKkwanGuan/mmdet-GCD.
comment: This paper is accepted by the GRSL in 2025
☆ NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception NeurIPS 2025
Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality's agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
comment: 19 pages, Accepted by NeurIPS 2025
☆ VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images
Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.
☆ Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation ICCV 2025
Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.
comment: 15 pages, 18 figures, GitHub link: https://github.com/google-deepmind/sketch_to_layout, accept at ICCV 2025 Workshop (HiGen)
☆ Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
comment: 20 pages, 10 figures
☆ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
comment: preprint
☆ Who Made This? Fake Detection and Source Attribution with Diffusion Features
The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.
☆ ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning
Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.
comment: 11 pages, 1 figure
☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.
☆ Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
☆ Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation
Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.
☆ NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding NeurIPS 2025
Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.
comment: Accepted to NeurIPS 2025. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS
☆ Referee: Reference-aware Audiovisual Deepfake Detection
Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
comment: In Progress
☆ A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning
This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and "mislabelled" schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.
☆ From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.
☆ CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging MICCAI 2025
Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.
comment: Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]
☆ DeblurSDI: Blind Image Deblurring Using Self-diffusion
Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.
☆ Mitigating Semantic Collapse in Partially Relevant Video Retrieval NeurIPS 2025
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
comment: Accpeted to NeurIPS 2025. Code is available at https://github.com/admins97/MSC_PRVR
☆ Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset NeurIPS
Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
comment: Medical Imaging Meets EurIPS (NeurIPS-endorsed workshop) - MedEurIPS
☆ A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection ACSA
The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.
comment: 11 pages, 13 figures, 9 tables, Published with International Journal of Advanced Computer Science and Applications (IJACSA)
☆ Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
☆ Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V
We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
comment: video generation, image-to-video, dif- fusion transformer, LoRA, fine-tuning, cinematic scene synthesis, multi-GPU inference, fully sharded data parallelism, computational efficiency
☆ FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
☆ RzenEmbed: Towards Comprehensive Multimodal Retrieval
The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.
☆ Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing
Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.
☆ MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI MICCAI 2025
The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at https://github.com/MIC-DKFZ/MeisenMeister
comment: Winning Solution of the MICCAI 2025 ODELIA Breast MRI Classification Challenge
☆ Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis
We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.
☆ SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction
Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.
☆ Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
☆ CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram
Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
☆ A fragile zero-watermarking method based on dual quaternion matrix decomposition
Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.
comment: 18 pages, 6 figures, 3 tables
☆ Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba
Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($<$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.
☆ Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
☆ HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration
Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.
☆ T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
comment: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures
☆ RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
☆ Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
☆ C-LEAD: Contrastive Learning for Enhanced Adversarial Defense
Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.
☆ Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation
In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier's robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at https://github.com/Mayank94/Trans-Defense.
☆ Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis
Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.
comment: 22 pages, 9 figures
☆ Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting
Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input's geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at https://github.com/tlliao/Object-IR.
comment: Publish in Pattern Recognition
☆ MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
comment: Project Page: https://g-1nonly.github.io/MoRE_Website/, Code: https://github.com/alibaba/Taobao3D
☆ Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery
Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.
☆ Soft Task-Aware Routing of Experts for Equivariant Representation Learning NeurIPS 2025
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
comment: NeurIPS 2025
☆ SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping
Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.
☆ Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
☆ GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation NeurIPS 2025
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.
comment: Published in NeurIPS 2025
☆ Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
☆ Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
☆ SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles
Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.
comment: 16 pages, 29 figures. Accepted at 26th Privacy Enhancing Technologies Symposium (PETS 2026)
☆ H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models
Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.
☆ DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model
Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.
☆ M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar
Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
comment: 16 pages, 9 figures
☆ Generating Accurate and Detailed Captions for High-Resolution Images
Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
comment: Work conducted in 2024; released for archival purposes
☆ How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring
The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.
☆ AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification
Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.
☆ HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
☆ Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
☆ E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.
☆ WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.
comment: This paper has been accepted by MMM 2026
☆ ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding NeurIPS 2025
Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.
comment: Accepted by NeurIPS 2025
☆ Hierarchical Transformers for Unsupervised 3D Shape Abstraction
We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.
♻ ☆ Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-Analysis
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent challenges of bright-field images -- their noisiness, low contrast, and dynamic morphology -- coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications -- particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility
♻ ☆ Conformal Object Detection by Sequential Risk Control
Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in safety-critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc predictive uncertainty quantification procedure with statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold. First, we formally define the problem of Conformal Object Detection (COD). We introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control to two sequential tasks with two parameters, as required in the COD setting. Then, we present old and new loss functions and prediction sets suited to applying SeqCRC to different cases and certification requirements. Finally, we present a conformal toolkit for replication and further exploration of our method. Using this toolkit, we perform extensive experiments that validate our approach and emphasize trade-offs and other practical consequences.
comment: 29 pages, 12 figures
♻ ☆ LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at https://github.com/juntaoJianggavin/LV-UNet.
comment: Accepted by IEEE BIBM2024 ML4BMI workshop
♻ ☆ SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
comment: Pattern Recognition, Volume 162, June 2025, 111344
♻ ☆ Augmented Reality-based Guidance with Deformable Registration in Head and Neck Tumor Resection MICCAI 2025
Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons' average target relocation error from 9.8 cm to 4.8 cm.
comment: Accepted at MICCAI 2025
♻ ☆ C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing ICCV 2025
Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.
comment: ICCV 2025 Workshop Wild3D
♻ ☆ NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation NeurIPS 2025
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. We introduce NoisyRollout, a simple yet effective data augmentation method that addresses these issues by mixing training trajectories from both clean and moderately distorted images. This approach injects perceptual diversity, encouraging better policy exploration and leading to more robust reasoning. A noise annealing schedule gradually reduces distortion strength, aiding exploration early in training while ensuring later stability. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on 2 distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
comment: NeurIPS 2025
♻ ☆ Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos NeurIPS 2025
This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code is available at https://github.com/hhhddddddd/dydeblur.
comment: Accepted to NeurIPS 2025
♻ ☆ GASP: Gaussian Splatting for Physic-Based Simulations
Physics simulation is paramount for modeling and utilizing 3D scenes in various real-world applications. However, integrating with state-of-the-art 3D scene rendering techniques such as Gaussian Splatting (GS) remains challenging. Existing models use additional meshing mechanisms, including triangle or tetrahedron meshing, marching cubes, or cage meshes. Alternatively, we can modify the physics-grounded Newtonian dynamics to align with 3D Gaussian components. Current models take the first-order approximation of a deformation map, which locally approximates the dynamics by linear transformations. In contrast, our GS for Physics-Based Simulations (GASP) pipeline uses parametrized flat Gaussian distributions. Consequently, the problem of modeling Gaussian components using the physics engine is reduced to working with 3D points. In our work, we present additional rules for manipulating Gaussians, demonstrating how to adapt the pipeline to incorporate meshes, control Gaussian sizes during simulations, and enhance simulation efficiency. This is achieved through the Gaussian grouping strategy, which implements hierarchical structuring and enables simulations to be performed exclusively on selected Gaussians. The resulting solution can be integrated into any physics engine that can be treated as a black box. As demonstrated in our studies, the proposed pipeline exhibits superior performance on a diverse range of benchmark datasets designed for 3D object rendering. The project webpage, which includes additional visualizations, can be found at https://waczjoan.github.io/GASP.
♻ ☆ D$^2$USt3R: Enhancing 3D Reconstruction for Dynamic Scenes NeurIPS 2025
In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose $D^2USt3R$ that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.
comment: NeurIPS 2025; project page: https://cvlab-kaist.github.io/DDUSt3R/
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ Rethinking Metrics and Benchmarks of Video Anomaly Detection
Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ Manifold Learning for Hyperspectral Images
Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
♻ ☆ EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting NeurIPS 2025
Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and frames. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our project page is https://lbh666.github.io/ef-3dgs/.
comment: Accepted to NeurIPS 2025,Project Page: https://lbh666.github.io/ef-3dgs/
♻ ☆ Poisson Informed Retinex Network for Extreme Low-Light Image Enhancement
Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
comment: 10 pages, 5 figures and 1 table
♻ ☆ LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
comment: 10 pages, 5 figures, 14 tables, Neurips 2025
♻ ☆ FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
♻ ☆ BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation
Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM's Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.
♻ ☆ AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
comment: Homepage: https://www.amd.com/en/developer/resources/technical-articles/amd-hummingbird-0-9b-text-to-video-diffusion-model-with-4-step-inferencing.html| GitHub: https://github.com/AMD-AIG-AIMA/AMD-Hummingbird-T2V
♻ ☆ Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling NeurIPS 2025
Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.
comment: To appear in NeurIPS 2025
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussians
Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We use a unified mixed Gaussian representation to integrate the two modalities of 2D image and 3D mesh. Furthermore, the comprehensive experiments demonstrate the superiority of MixedGaussianAvatar. The code will be released.
♻ ☆ Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers ICCV2025
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.
comment: ICCV2025
♻ ☆ Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at https://github.com/MengfeiD/PanOoS
♻ ☆ Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
♻ ☆ IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
comment: https://github.com/lifuguan/IGGT_official
♻ ☆ Scaling Diffusion Transformers Efficiently via $μ$P NeurIPS 2025
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
comment: Accepted by NeurIPS 2025, 38 pages, 10 figures, 17 tables
♻ ☆ Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Despite remarkable progress in recent years, vision language models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models selectively predict, that is, models respond only when they are sufficiently confident. Unfortunately, Bayesian methods are often assumed to be costly and ineffective for large models, and so far there exists little evidence to show otherwise, especially for multimodal applications. Here, we show the effectiveness and competitive edge of variational Bayes for selective prediction in VQA for the first time. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample can yield more reliable answers than those obtained by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
comment: under review at TMLR
♻ ☆ On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting
Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment and provide a straightforward implementation. By theoretical analysis, our framework is feasible to achieve superior cross-domain generalization for counting. In the presented network, the features related to density are explicitly preserved across domains. Theoretically, we formalize the notion of conditional divergence by partitioning each domain into subsets and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. Empirically, we demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.
comment: 18 pages, 6 figures, 5 tables
♻ ☆ SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
♻ ☆ Continual Vision-and-Language Navigation
Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
♻ ☆ Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
♻ ☆ DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
♻ ☆ Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
♻ ☆ AVA: Towards Agentic Video Analytics with Vision Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Tensor Completion via Monotone Inclusion: Generalized Low-Rank Priors Meet Deep Denoisers
Missing entries in multi dimensional data pose significant challenges for downstream analysis across diverse real world applications. These data are naturally represented as tensors, and recent completion methods integrating global low rank priors with plug and play denoisers have demonstrated strong empirical performance. However, these approaches often rely on empirical convergence alone or unrealistic assumptions, such as deep denoisers acting as proximal operators of implicit regularizers, which generally does not hold. To address these limitations, we propose a novel tensor completion framework grounded in the monotone inclusion paradigm. Within this framework, deep denoisers are treated as general operators that require far fewer restrictions than in classical optimization based formulations. To better capture holistic structure, we further incorporate generalized low rank priors with weakly convex penalties. Building upon the Davis Yin splitting scheme, we develop the GTCTV DPC algorithm and rigorously establish its global convergence. Extensive experiments demonstrate that GTCTV DPC consistently outperforms existing methods in both quantitative metrics and visual quality, particularly at low sampling rates. For instance, at a sampling rate of 0.05 for multi dimensional image completion, GTCTV DPC achieves an average mean peak signal to noise ratio (MPSNR) that surpasses the second best method by 0.717 dB, and 0.649 dB for multi spectral images, and color videos, respectively.
comment: 14 pages, 8 figures, 6 tables
♻ ☆ Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time
The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will commence full-scale operations in 2026, yielding an unprecedented volume of astronomical images. Constructing an astronomical catalog, a table of imaged stars, galaxies, and their properties, is a fundamental step in most scientific workflows based on astronomical image data. Traditional deterministic cataloging methods lack statistical coherence as cataloging is an ill-posed problem, while existing probabilistic approaches suffer from computational inefficiency, inaccuracy, or the inability to perform inference with multiband coadded images, the primary output format for LSST images. In this article, we explore a recently developed Bayesian inference method called neural posterior estimation (NPE) as an approach to cataloging. NPE leverages deep learning to achieve both computational efficiency and high accuracy. When evaluated on the DC2 Simulated Sky Survey -- a highly realistic synthetic dataset designed to mimic LSST data -- NPE systematically outperforms the standard LSST pipeline in light source detection, flux measurement, star/galaxy classification, and galaxy shape measurement. Additionally, NPE provides well-calibrated posterior approximations. These promising results, obtained using simulated data, illustrate the potential of NPE in the absence of model misspecification. Although some degree of model misspecification is inevitable in the application of NPE to real LSST images, there are a variety of strategies to mitigate its effects.
♻ ☆ Face Spoofing Detection using Deep Learning
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
comment: The author's school has a conflict of interest regarding the submission of this article prior to his graduation thesis submission
Artificial Intelligence 177
☆ Continuous Autoregressive Language Models
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
☆ PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
☆ MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
comment: 21 pages
☆ Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
☆ Community Detection on Model Explanation Graphs for Explainable AI
Feature-attribution methods (e.g., SHAP, LIME) explain individual predictions but often miss higher-order structure: sets of features that act in concert. We propose Modules of Influence (MoI), a framework that (i) constructs a model explanation graph from per-instance attributions, (ii) applies community detection to find feature modules that jointly affect predictions, and (iii) quantifies how these modules relate to bias, redundancy, and causality patterns. Across synthetic and real datasets, MoI uncovers correlated feature groups, improves model debugging via module-level ablations, and localizes bias exposure to specific modules. We release stability and synergy metrics, a reference implementation, and evaluation protocols to benchmark module discovery in XAI.
☆ Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
☆ VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images
Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.
☆ Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation ICCV 2025
Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.
comment: 15 pages, 18 figures, GitHub link: https://github.com/google-deepmind/sketch_to_layout, accept at ICCV 2025 Workshop (HiGen)
☆ Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training
Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
☆ Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models
Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose \eval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. \eval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.
comment: 17 Pages, 5 figures
☆ Validity Is What You Need
While AI agents have long been discussed and studied in computer science, today's Agentic AI systems are something new. We consider other definitions of Agentic AI and propose a new realist definition. Agentic AI is a software delivery mechanism, comparable to software as a service (SaaS), which puts an application to work autonomously in a complex enterprise setting. Recent advances in large language models (LLMs) as foundation models have driven excitement in Agentic AI. We note, however, that Agentic AI systems are primarily applications, not foundations, and so their success depends on validation by end users and principal stakeholders. The tools and techniques needed by the principal users to validate their applications are quite different from the tools and techniques used to evaluate foundation models. Ironically, with good validation measures in place, in many cases the foundation models can be replaced with much simpler, faster, and more interpretable models that handle core logic. When it comes to Agentic AI, validity is what you need. LLMs are one option that might achieve it.
☆ Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
☆ VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15--30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
☆ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
comment: preprint
☆ InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.
comment: Short Paper - Under Review
☆ CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
☆ Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
☆ Sybil-Resistant Service Discovery for Agent Economies
x402 enables Hypertext Transfer Protocol (HTTP) services like application programming interfaces (APIs), data feeds, and inference providers to accept cryptocurrency payments for access. As agents increasingly consume these services, discovery becomes critical: which swap interface should an agent trust? Which data provider is the most reliable? We introduce TraceRank, a reputation-weighted ranking algorithm where payment transactions serve as endorsements. TraceRank seeds addresses with precomputed reputation metrics and propagates reputation through payment flows weighted by transaction value and temporal recency. Applied to x402's payment graph, this surfaces services preferred by high-reputation users rather than those with high transaction volume. Our system combines TraceRank with semantic search to respond to natural language queries with high quality results. We argue that reputation propagation resists Sybil attacks by making spam services with many low-reputation payers rank below legitimate services with few high-reputation payers. Ultimately, we aim to construct a search method for x402 enabled services that avoids infrastructure bias and has better performance than purely volume based or semantic methods.
comment: 5 pages
☆ EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
comment: 9 pages, 6 figures, 4 tables
☆ Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.
☆ DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
comment: 9 pages, 9 tables
☆ TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
☆ Leveraging Generic Time Series Foundation Models for EEG Classification
Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
☆ Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation
Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.
☆ DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm
To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use R\'enyi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/DP-FedPGN
comment: 21 pages, 8 figures
☆ InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
☆ FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.
☆ Thought Branches: Interpreting LLM Reasoning Requires Resampling
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
comment: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip
☆ VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision
Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.
comment: Under Review
☆ GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language
Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry for their capabilities in handling multi-modal tasks. However, these models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data. To address this issue, synthetic geometric data has become an essential strategy. Current methods for generating synthetic geometric data involve rephrasing or expanding existing problems and utilizing predefined rules and templates to create geometric images and problems. However, these approaches often produce data that lacks diversity or is prone to noise. Additionally, the geometric images synthesized by existing methods tend to exhibit limited variation and deviate significantly from authentic geometric diagrams. To overcome these limitations, we propose GeoFM, a novel method for synthesizing geometric data. GeoFM uses formal languages to explore combinations of conditions within metric space, generating high-fidelity geometric problems that differ from the originals while ensuring correctness through a symbolic engine. Experimental results show that our synthetic data significantly outperforms existing methods. The model trained with our data surpass the proprietary GPT-4o model by 18.7\% on geometry problem-solving tasks in MathVista and by 16.5\% on GeoQA. Additionally, it exceeds the performance of a leading open-source model by 5.7\% on MathVista and by 2.7\% on GeoQA.
☆ CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging MICCAI 2025
Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.
comment: Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]
☆ Mitigating Semantic Collapse in Partially Relevant Video Retrieval NeurIPS 2025
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
comment: Accpeted to NeurIPS 2025. Code is available at https://github.com/admins97/MSC_PRVR
☆ Learning Soft Robotic Dynamics with Active Exploration
Soft robots offer unmatched adaptability and safety in unstructured environments, yet their compliant, high-dimensional, and nonlinear dynamics make modeling for control notoriously difficult. Existing data-driven approaches often fail to generalize, constrained by narrowly focused task demonstrations or inefficient random exploration. We introduce SoftAE, an uncertainty-aware active exploration framework that autonomously learns task-agnostic and generalizable dynamics models of soft robotic systems. SoftAE employs probabilistic ensemble models to estimate epistemic uncertainty and actively guides exploration toward underrepresented regions of the state-action space, achieving efficient coverage of diverse behaviors without task-specific supervision. We evaluate SoftAE on three simulated soft robotic platforms -- a continuum arm, an articulated fish in fluid, and a musculoskeletal leg with hybrid actuation -- and on a pneumatically actuated continuum soft arm in the real world. Compared with random exploration and task-specific model-based reinforcement learning, SoftAE produces more accurate dynamics models, enables superior zero-shot control on unseen tasks, and maintains robustness under sensing noise, actuation delays, and nonlinear material effects. These results demonstrate that uncertainty-driven active exploration can yield scalable, reusable dynamics models across diverse soft robotic morphologies, representing a step toward more autonomous, adaptable, and data-efficient control in compliant robots.
☆ Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset NeurIPS
Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
comment: Medical Imaging Meets EurIPS (NeurIPS-endorsed workshop) - MedEurIPS
☆ DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
comment: Work in progress
☆ Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
☆ Dialogue as Discovery: Navigating Human Intent Through Principled Inquiry
A fundamental bottleneck in human-AI collaboration is the "intention expression gap," the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. Moreover, its design is domain-agnostic, and we show evidence of generalization beyond diagram generation. Experimental results prove that our work offers a principled, scalable, and adaptive paradigm for resolving uncertainty about user intent in complex human-AI collaboration.
☆ FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
☆ Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs
Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module's measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.
☆ Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints
Modelling pedestrian-driver interactions is critical for understanding human road user behaviour and developing safe autonomous vehicle systems. Existing approaches often rely on rule-based logic, game-theoretic models, or 'black-box' machine learning methods. However, these models typically lack flexibility or overlook the underlying mechanisms, such as sensory and motor constraints, which shape how pedestrians and drivers perceive and act in interactive scenarios. In this study, we propose a multi-agent reinforcement learning (RL) framework that integrates both visual and motor constraints of pedestrian and driver agents. Using a real-world dataset from an unsignalised pedestrian crossing, we evaluate four model variants, one without constraints, two with either motor or visual constraints, and one with both, across behavioural metrics of interaction realism. Results show that the combined model with both visual and motor constraints performs best. Motor constraints lead to smoother movements that resemble human speed adjustments during crossing interactions. The addition of visual constraints introduces perceptual uncertainty and field-of-view limitations, leading the agents to exhibit more cautious and variable behaviour, such as less abrupt deceleration. In this data-limited setting, our model outperforms a supervised behavioural cloning model, demonstrating that our approach can be effective without large training datasets. Finally, our framework accounts for individual differences by modelling parameters controlling the human constraints as population-level distributions, a perspective that has not been explored in previous work on pedestrian-vehicle interaction modelling. Overall, our work demonstrates that multi-agent RL with human constraints is a promising modelling approach for simulating realistic road user interactions.
☆ Spiking Neural Networks: The Future of Brain-Inspired Computing
Spiking Neural Networks (SNNs) represent the latest generation of neural computation, offering a brain-inspired alternative to conventional Artificial Neural Networks (ANNs). Unlike ANNs, which depend on continuous-valued signals, SNNs operate using distinct spike events, making them inherently more energy-efficient and temporally dynamic. This study presents a comprehensive analysis of SNN design models, training algorithms, and multi-dimensional performance metrics, including accuracy, energy consumption, latency, spike count, and convergence behavior. Key neuron models such as the Leaky Integrate-and-Fire (LIF) and training strategies, including surrogate gradient descent, ANN-to-SNN conversion, and Spike-Timing Dependent Plasticity (STDP), are examined in depth. Results show that surrogate gradient-trained SNNs closely approximate ANN accuracy (within 1-2%), with faster convergence by the 20th epoch and latency as low as 10 milliseconds. Converted SNNs also achieve competitive performance but require higher spike counts and longer simulation windows. STDP-based SNNs, though slower to converge, exhibit the lowest spike counts and energy consumption (as low as 5 millijoules per inference), making them optimal for unsupervised and low-power tasks. These findings reinforce the suitability of SNNs for energy-constrained, latency-sensitive, and adaptive applications such as robotics, neuromorphic vision, and edge AI systems. While promising, challenges persist in hardware standardization and scalable training. This study concludes that SNNs, with further refinement, are poised to propel the next phase of neuromorphic computing.
comment: 17 pages, 7 figures, 4 tables, Published with International Journal of Engineering Trends and Technology (IJETT)
☆ Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
☆ Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V
We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
comment: video generation, image-to-video, dif- fusion transformer, LoRA, fine-tuning, cinematic scene synthesis, multi-GPU inference, fully sharded data parallelism, computational efficiency
☆ ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.
☆ An In-depth Study of LLM Contributions to the Bin Packing Problem
Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.
comment: 15 pages, 13 figures
☆ Discriminative Rule Learning for Outcome-Guided Process Model Discovery
Event logs extracted from information systems offer a rich foundation for understanding and improving business processes. In many real-world applications, it is possible to distinguish between desirable and undesirable process executions, where desirable traces reflect efficient or compliant behavior, and undesirable ones may involve inefficiencies, rule violations, delays, or resource waste. This distinction presents an opportunity to guide process discovery in a more outcome-aware manner. Discovering a single process model without considering outcomes can yield representations poorly suited for conformance checking and performance analysis, as they fail to capture critical behavioral differences. Moreover, prioritizing one behavior over the other may obscure structural distinctions vital for understanding process outcomes. By learning interpretable discriminative rules over control-flow features, we group traces with similar desirability profiles and apply process discovery separately within each group. This results in focused and interpretable models that reveal the drivers of both desirable and undesirable executions. The approach is implemented as a publicly available tool and it is evaluated on multiple real-life event logs, demonstrating its effectiveness in isolating and visualizing critical process patterns.
comment: The paper will be published as part of the CoopIS 2025 conference proceedings
☆ Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines
Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.
☆ Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis
We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.
☆ CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram
Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
☆ Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
☆ Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments EMNLP 2025
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
comment: Accepted at EMNLP 2025 Main Track
☆ HiF-DTA: Hierarchical Feature Learning Network for Drug-Target Affinity Prediction
Accurate prediction of Drug-Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence-based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic-level, substructural-level, and molecular-level multi-scale features. We propose HiF-DTA, a hierarchical network that adopts a dual-pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi-scale to learn atomic, substructural, and molecular representations fused via a multi-scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF-DTA outperforms state-of-the-art baselines, with ablations confirming the importance of global-local extraction and multi-scale fusion.
comment: Accepted by International Conference on Bioinformatics and Biomedicine (BIBM 25)
☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
☆ Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
☆ MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
☆ Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
comment: 14 pages, 3 Figures
☆ Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
☆ Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication
Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.
comment: Accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering
☆ Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
☆ Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a score-sensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions. We present a case study of four LaaJs to demonstrate SparseAlign's utility in real-world evaluation scenarios.
☆ DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries SIGMOD 2026
Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users' analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at https://github.com/uiuc-kang-lab/drama.
comment: Accepted to SIGMOD 2026
☆ Soft Task-Aware Routing of Experts for Equivariant Representation Learning NeurIPS 2025
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
comment: NeurIPS 2025
☆ Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
☆ GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation NeurIPS 2025
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.
comment: Published in NeurIPS 2025
☆ Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
☆ Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models
Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model's final state. By collapsing a feature's role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model's learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.
☆ Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering
The rapid evolution of large language models (LLMs) has intensified the demand for effective personalization techniques that can adapt model behavior to individual user preferences. Despite the non-parametric methods utilizing the in-context learning ability of LLMs, recent parametric adaptation methods, including personalized parameter-efficient fine-tuning and reward modeling emerge. However, these methods face limitations in handling dynamic user patterns and high data sparsity scenarios, due to low adaptability and data efficiency. To address these challenges, we propose a fine-grained and instance-tailored steering framework that dynamically generates sample-level interference vectors from user data and injects them into the model's forward pass for personalized adaptation. Our approach introduces two key technical innovations: a fine-grained steering component that captures nuanced signals by hooking activations from attention and MLP layers, and an input-aware aggregation module that synthesizes these signals into contextually relevant enhancements. The method demonstrates high flexibility and data efficiency, excelling in fast-changing distribution and high data sparsity scenarios. In addition, the proposed method is orthogonal to existing methods and operates as a plug-in component compatible with different personalization techniques. Extensive experiments across diverse scenarios--including short-to-long text generation, and web function calling--validate the effectiveness and compatibility of our approach. Results show that our method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths. Implementation is available at https://github.com/KounianhuaDu/Fints.
☆ MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models EMNLP 2025
The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
comment: EMNLP 2025
☆ From product to system network challenges in system of systems lifecycle management
Today, products are no longer isolated artifacts, but nodes in networked systems. This means that traditional, linearly conceived life cycle models are reaching their limits: Interoperability across disciplines, variant and configuration management, traceability, and governance across organizational boundaries are becoming key factors. This collective contribution classifies the state of the art and proposes a practical frame of reference for SoS lifecycle management, model-based systems engineering (MBSE) as the semantic backbone, product lifecycle management (PLM) as the governance and configuration level, CAD-CAE as model-derived domains, and digital thread and digital twin as continuous feedback. Based on current literature and industry experience, mobility, healthcare, and the public sector, we identify four principles: (1) referenced architecture and data models, (2) end-to-end configuration sovereignty instead of tool silos, (3) curated models with clear review gates, and (4) measurable value contributions along time, quality, cost, and sustainability. A three-step roadmap shows the transition from product- to network- centric development: piloting with reference architecture, scaling across variant and supply chain spaces, organizational anchoring (roles, training, compliance). The results are increased change robustness, shorter throughput times, improved reuse, and informed sustainability decisions. This article is aimed at decision-makers and practitioners who want to make complexity manageable and design SoS value streams to be scalable.
☆ Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization of today's hardware, but parallelizing POMDP solvers has been challenging. They rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can quickly offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation that analytically solves part of the optimization component, leaving only the estimation of expectations for numerical computation. VOPP represents all data structures related to planning as a collection of tensors and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel solver with no dependencies and synchronization bottlenecks between parallel computations. Experimental results indicate that VOPP is at least 20X more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver.
comment: 8 pages, 3 figures. Submitted to ICRA 2026
☆ Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
comment: 5 pages, 3 figures
☆ Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
☆ FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction
Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.
☆ Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.
☆ H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models
Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.
☆ Generating Accurate and Detailed Captions for High-Resolution Images
Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
comment: Work conducted in 2024; released for archival purposes
☆ MARIA: A Framework for Marginal Risk Assessment without Ground Truth in AI Systems
Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal risk assessment framework, that avoids dependence on ground truth or absolute risk. It emphasizes three kinds of relative evaluation methodology, including predictability, capability and interaction dominance. By shifting focus from absolute to relative evaluation, our approach equips software teams with actionable guidance: identifying where AI enhances outcomes, where it introduces new risks, and how to adopt such systems responsibly.
comment: 9 pages, 1 figure
☆ Exploring Landscapes for Better Minima along Valleys
Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor "E" for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
comment: Neurips 2025 poster
☆ ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding NeurIPS 2025
Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.
comment: Accepted by NeurIPS 2025
☆ AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
☆ Expressive Range Characterization of Open Text-to-Audio Models AAAI
Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.
comment: Accepted at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 2025)
☆ CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
comment: Code/data: https://github.com/ref-grader/ref-grader, https://huggingface.co/datasets/combviz/inoi
☆ QiNN-QJ: A Quantum-inspired Neural Network with Quantum Jump for Multimodal Sentiment Analysis
Quantum theory provides non-classical principles, such as superposition and entanglement, that inspires promising paradigms in machine learning. However, most existing quantum-inspired fusion models rely solely on unitary or unitary-like transformations to generate quantum entanglement. While theoretically expressive, such approaches often suffer from training instability and limited generalizability. In this work, we propose a Quantum-inspired Neural Network with Quantum Jump (QiNN-QJ) for multimodal entanglement modelling. Each modality is firstly encoded as a quantum pure state, after which a differentiable module simulating the QJ operator transforms the separable product state into the entangled representation. By jointly learning Hamiltonian and Lindblad operators, QiNN-QJ generates controllable cross-modal entanglement among modalities with dissipative dynamics, where structured stochasticity and steady-state attractor properties serve to stabilize training and constrain entanglement shaping. The resulting entangled states are projected onto trainable measurement vectors to produce predictions. In addition to achieving superior performance over the state-of-the-art models on benchmark datasets, including CMU-MOSI, CMU-MOSEI, and CH-SIMS, QiNN-QJ facilitates enhanced post-hoc interpretability through von-Neumann entanglement entropy. This work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modelling complex cross-modal correlations.
☆ Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation
Security applications are increasingly relying on large language models (LLMs) for cyber threat detection; however, their opaque reasoning often limits trust, particularly in decisions that require domain-specific cybersecurity knowledge. Because security threats evolve rapidly, LLMs must not only recall historical incidents but also adapt to emerging vulnerabilities and attack patterns. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in general LLM applications, but its potential for cybersecurity remains underexplored. In this work, we introduce a RAG-based framework designed to contextualize cybersecurity data and enhance LLM accuracy in knowledge retention and temporal reasoning. Using external datasets and the Llama-3-8B-Instruct model, we evaluate baseline RAG, an optimized hybrid retrieval approach, and conduct a comparative analysis across multiple performance metrics. Our findings highlight the promise of hybrid retrieval in strengthening the adaptability and reliability of LLMs for cybersecurity tasks.
☆ Towards a Measure of Algorithm Similarity
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
comment: 11 pages, many figures and images
☆ Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
comment: 19 pages
♻ ☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
♻ ☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
♻ ☆ A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows
CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution times based on sequential activity traces and static features, enabling early identification of potential deadline breaches. In tests, the system demonstrates high precision and F1 scores in predicting deadline compliance, illustrating the value of integrating process mining with machine learning for proactive software project management.
comment: 16 pages, 7 figures, 4 tables
♻ ☆ Data Therapist: Eliciting Domain Knowledge from Subject Matter Experts Using Large Language Models
Effective data visualization requires not only technical proficiency but also a deep understanding of the domain-specific context in which data exists. This context often includes tacit knowledge about data provenance, quality, and intended use, which is rarely explicit in the dataset itself. Motivated by growing demands to surface tacit knowledge, we present the Data Therapist, a web-based system that helps domain experts externalize such implicit knowledge through a mixed-initiative process combining iterative Q&A with interactive annotation. Powered by a large language model, the system automatically analyzes user-supplied datasets, prompts users with targeted questions, and supports annotation at varying levels of granularity. The resulting structured knowledge base can inform both human and automated visualization design. A qualitative study with expert pairs from Accounting, Political Science, and Computer Security revealed recurring patterns in how expert reason about their data and highlighted opportunities for AI support to enhance visualization design.
♻ ☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
♻ ☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
♻ ☆ UdonCare: Hierarchy Pruning for Unseen Domain Discovery in Predictive Healthcare
Healthcare providers often divide patient populations into cohorts based on shared clinical factors, such as medical history, to deliver personalized healthcare services. This idea has also been adopted in clinical prediction models, where it presents a vital challenge: capturing both global and cohort-specific patterns while enabling model generalization to unseen domains. Addressing this challenge falls under the scope of domain generalization (DG). However, conventional DG approaches often struggle in clinical settings due to the absence of explicit domain labels and the inherent gap in medical knowledge. To address this, we propose UdonCare, a hierarchy-guided method that iteratively divides patients into latent domains and decomposes domain-invariant (label) information from patient data. Our method identifies patient domains by pruning medical ontologies (e.g. ICD-9-CM hierarchy). On two public datasets, MIMIC-III and MIMIC-IV, UdonCare shows superiority over eight baselines across four clinical prediction tasks with substantial domain gaps, highlighting the untapped potential of medical knowledge in guiding clinical domain generalization problems.
♻ ☆ I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models NeurIPS 2025
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
comment: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
♻ ☆ SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks EMNLP 2025
Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmlessness, and summary quality. Our method obtains +10% and +3% win rate points in summarization and dialogue scenarios, respectively, without compromising model reasoning or the relevancy and faithfulness of the summary response.
comment: 27 pages, 9 figures, 5 tables. Accepted to EMNLP 2025
♻ ☆ VRoPE: Rotary Position Embedding for Video Large Language Models EMNLP 2025
Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.
comment: EMNLP 2025 Main Camera Ready
♻ ☆ A Survey of AI Scientists
Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field's evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.
comment: 28 pages, 9 figures, 1 table
♻ ☆ A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.
comment: Accepted to ICWSM'26
♻ ☆ Generative AI and Firm Productivity: Field Experiments in Online Retail
We quantify the impact of Generative Artificial Intelligence (GenAI) on firm productivity through a series of large-scale randomized field experiments involving millions of users and products at a leading cross-border online retail platform. Over six months in 2023-2024, GenAI-based enhancements were integrated into seven consumer-facing business workflows. We find that GenAI adoption significantly increases sales, with treatment effects ranging from $0\%$ to $16.3\%$, depending on GenAI's marginal contribution relative to existing firm practices. Because inputs and prices were held constant across experimental arms, these gains map directly into total factor productivity improvements. Across the four GenAI applications with positive effects, the implied annual incremental value is approximately $\$ 5$ per consumer-an economically meaningful impact given the retailer's scale and the early stage of GenAI adoption. The primary mechanism operates through higher conversion rates, consistent with GenAI reducing frictions in the marketplace and improving consumer experience. We also document substantial heterogeneity: smaller and newer sellers, as well as less experienced consumers, exhibit disproportionately larger gains. Our findings provide novel, large-scale causal evidence on the productivity effects of GenAI in online retail, highlighting both its immediate value and broader potential.
comment: Keywords: Field Experiments, Generative AI, Productivity, Retail Platforms, Consumer Experience. JEL codes: C93, D24, L81, M31, O3
♻ ☆ PoLAR: Polar-Decomposed Low-Rank Adapter Representation
We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
♻ ☆ NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.
♻ ☆ Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind
We report a structural convergence among four influential theories of mind: Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind, emerging unintentionally within a practical AI architecture known as Agentic Flow. Designed to address the limitations of large language models (LLMs), Agentic Flow comprises five interlocking modules: Retrieval, Cognition, Control, Memory, and Action, organized into a repeatable cognitive loop. Although originally inspired only by Minsky and Clark, subsequent analysis revealed that its structure echoes computational motifs from all four theories, suggesting that theoretical convergence can emerge naturally from implementation demands rather than deliberate synthesis. Controlled evaluations confirmed this: the structured agent achieved 95.8% task success versus 62.3% for baseline LLMs, demonstrating robust constraint adherence and reproducible reasoning. We describe this convergence under a broader descriptive meta-architecture called PEACE, highlighting recurring design patterns such as predictive modeling, associative recall, and error-sensitive control. Later formalized as the Structured Cognitive Loop (SCL), this framework generalizes the same principles as a foundation for behavioral intelligence in LLM-based agents. Rather than claiming theoretical unification, this paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints. As a position paper, it aims to frame this convergence as an interpretive reflection rather than a finalized theory, inviting further theoretical and experimental dialogue. Agentic Flow, or equivalently the Structured Cognitive Loop, thus offers a glimpse of how a unified cognitive form can arise not from abstraction, but from the necessities of real-world reasoning.
comment: This version relocates the "Position Paper" designation from the title to the abstract and adds a citation to the related follow-up study Structured Cognition for Behavioral Intelligence in Large Language Model Agents (Kim, 2025), also available on arXiv
♻ ☆ Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization
Large Language Models (LLMs) are increasingly integrated into web-based systems for content summarization, yet their susceptibility to prompt injection attacks remains a pressing concern. In this study, we explore how non-visible HTML elements such as , aria-label, and alt attributes can be exploited to embed adversarial instructions without altering the visible content of a webpage. We introduce a novel dataset comprising 280 static web pages, evenly divided between clean and adversarial injected versions, crafted using diverse HTML-based strategies. These pages are processed through a browser automation pipeline to extract both raw HTML and rendered text, closely mimicking real-world LLM deployment scenarios. We evaluate two state-of-the-art open-source models, Llama 4 Scout (Meta) and Gemma 9B IT (Google), on their ability to summarize this content. Using both lexical (ROUGE-L) and semantic (SBERT cosine similarity) metrics, along with manual annotations, we assess the impact of these covert injections. Our findings reveal that over 29% of injected samples led to noticeable changes in the Llama 4 Scout summaries, while Gemma 9B IT showed a lower, yet non-trivial, success rate of 15%. These results highlight a critical and largely overlooked vulnerability in LLM driven web pipelines, where hidden adversarial content can subtly manipulate model outputs. Our work offers a reproducible framework and benchmark for evaluating HTML-based prompt injection and underscores the urgent need for robust mitigation strategies in LLM applications involving web content.
♻ ☆ Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing
This paper studies fast adversarial training against sparse adversarial perturbations bounded by $l_0$ norm. We demonstrate the challenges of employing $1$-step attacks on $l_0$ bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in $l_0$ adversarial training is caused by sub-optimal perturbation locations of $1$-step attack. Theoretical and empirical analyses reveal that the loss landscape of $l_0$ adversarial training is more craggy compared to its $l_\infty$, $l_2$ and $l_1$ counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS-$l_0$ that incorporates soft labels and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieve state-of-the-art performance, and narrow down the performance gap between $1$-step and multi-step adversarial training against sparse attacks.
♻ ☆ Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents
Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation - referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) - within the same input token order of magnitude (1e3). Our best evaluated configurations - one token order above, but within the model's context window - outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.
comment: 20 pages, LaTeX; repository URL updated, typos corrected
♻ ☆ Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness ECAI 2025
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
comment: ECAI 2025
♻ ☆ GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development cycle. This work introduces \textit{GenSwarm}, an end-to-end system that leverages large language models to automatically generate and deploy control policies for multi-robot tasks based on simple user instructions in natural language. As a multi-language-agent system, GenSwarm achieves zero-shot learning, enabling rapid adaptation to altered or unseen tasks. The white-box nature of the code policies ensures strong reproducibility and interpretability. With its scalable software and hardware architectures, GenSwarm supports efficient policy deployment on both simulated and real-world multi-robot systems, realizing an instruction-to-execution end-to-end functionality that could prove valuable for robotics specialists and non-specialists alike.The code of the proposed GenSwarm system is available online: https://github.com/WindyLab/GenSwarm.
comment: This article has been accepted for publication in npj Robotics
♻ ☆ DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers
Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
♻ ☆ Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@k) of LLMs in reasoning tasks, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability as it focuses on improving the accuracy of the easier questions to the detriment of the accuracy of the most difficult questions. Second, we show that RLVR does not merely increase the success probability for the easier questions, but in our small model settings, produces quality responses that were absent in its original output distribution. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in LLMs
comment: 25 pages
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS
To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
comment: Proceedings of the Neural Information Processing Systems (NeurIPS) 2025; 39 pages, 7 figures, 15 tables
♻ ☆ LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring AACL 2025
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.
comment: Accepted to IJCNLP-AACL 2025 (Main Conference)
♻ ☆ Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
♻ ☆ Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI
Explainable AI (XAI) aims to make AI systems more transparent, yet many practices emphasise mathematical rigour over practical user needs. We propose an alternative to this model-centric approach by following a design thinking process for the emerging XAI field of training data attribution (TDA), which risks repeating solutionist patterns seen in other subfields. However, because TDA is in its early stages, there is a valuable opportunity to shape its direction through user-centred practices. We engage directly with machine learning developers via a needfinding interview study (N=6) and a scenario-based interactive user study (N=31) to ground explanations in real workflows. Our exploration of the TDA design space reveals novel tasks for data-centric explanations useful to developers, such as grouping training samples behind specific model behaviours or identifying undersampled data. We invite the TDA, XAI, and HCI communities to engage with these tasks to strengthen their research's practical relevance and human impact.
♻ ☆ SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.
♻ ☆ Rethinking Metrics and Benchmarks of Video Anomaly Detection
Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.
♻ ☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
♻ ☆ On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.
comment: Added the ACM Functional v1.1 badge, the full publication citation, and a link to the publication on IEEE Xplore
♻ ☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
♻ ☆ Artificially intelligent agents in the social and behavioral sciences: A history and outlook
We review the historical development and current trends of artificially intelligent agents (agentic AI) in the social and behavioral sciences: from the first programmable computers, and social simulations soon thereafter, to today's experiments with large language models. This overview emphasizes the role of AI in the scientific process and the changes brought about, both through technological advancements and the broader evolution of science from around 1950 to the present. Some of the specific points we cover include: the challenges of presenting the first social simulation studies to a world unaware of computers, the rise of social systems science, intelligent game theoretic agents, the age of big data and the epistemic upheaval in its wake, and the current enthusiasm around applications of generative AI, and many other topics. A pervasive theme is how deeply entwined we are with the technologies we use to understand ourselves.
♻ ☆ LLM Based Long Code Translation using Identifier Replacement
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
♻ ☆ SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present SafeAgentBench -- the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available in https://github.com/shengyin1224/SafeAgentBench and https://huggingface.co/datasets/safeagentbench/SafeAgentBench.
comment: 28 pages, 19 tables, 15 figures
♻ ☆ LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
comment: 15 pages, 9 figures
♻ ☆ RepoMasterEval: Evaluating Code Completion via Real-World Repositories
With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by providing text descriptions to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes existing evaluation benchmarks poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 10 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report variance in model performance in real-world scenarios. The deployment of RepoMasterEval also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model's performance in practice.
♻ ☆ BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation
Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM's Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.
♻ ☆ A Framework for Objective-Driven Dynamical Stochastic Fields
Fields offer a versatile approach for describing complex systems composed of interacting and dynamic components. In particular, some of these dynamical and stochastic systems may exhibit goal-directed behaviors aimed at achieving specific objectives, which we refer to as $\textit{intelligent fields}$. However, due to their inherent complexity, it remains challenging to develop a formal theoretical description of such systems and to effectively translate these descriptions into practical applications. In this paper, we propose three fundamental principles to establish a theoretical framework for understanding intelligent fields: complete configuration, locality, and purposefulness. Moreover, we explore methodologies for designing such fields from the perspective of artificial intelligence applications. This initial investigation aims to lay the groundwork for future theoretical developments and practical advances in understanding and harnessing the potential of such objective-driven dynamical stochastic fields.
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Automatic Text Summarization (ATS), utilizing Natural Language Processing (NLP) algorithms, aims to create concise and accurate summaries, thereby significantly reducing the human effort required in processing large volumes of text. ATS has drawn considerable interest in both academic and industrial circles. Many studies have been conducted in the past to survey ATS methods; however, they generally lack practicality for real-world implementations, as they often categorize previous methods from a theoretical standpoint. Moreover, the advent of Large Language Models (LLMs) has altered conventional ATS methods. In this survey, we aim to 1) provide a comprehensive overview of ATS from a ``Process-Oriented Schema'' perspective, which is best aligned with real-world implementations; 2) comprehensively review the latest LLM-based ATS works; and 3) deliver an up-to-date survey of ATS, bridging the two-year gap in the literature. To the best of our knowledge, this is the first survey to specifically investigate LLM-based ATS methods.
♻ ☆ Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
♻ ☆ BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction
Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physics-aligned interactive benchmark designed for language-driven engineering construction. It contributes to the community in four aspects: (1) a highly customizable benchmarking framework for in-depth comparison and analysis of LLMs; (2) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (3) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions; (4) a baseline LLM agentic workflow that effectively evaluates diverse model capabilities. On eight frontier LLMs, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation. The project page is at https://build-arena.github.io/.
comment: 33 pages, 10 figures
♻ ☆ Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
♻ ☆ Artificial Empathy: AI based Mental Health
Many people suffer from mental health problems but not everyone seeks professional help or has access to mental health care. AI chatbots have increasingly become a go-to for individuals who either have mental disorders or simply want someone to talk to. This paper presents a study on participants who have previously used chatbots and a scenario-based testing of large language model (LLM) chatbots. Our findings indicate that AI chatbots were primarily utilized as a "Five minute therapist" or as a non-judgmental companion. Participants appreciated the anonymity and lack of judgment from chatbots. However, there were concerns about privacy and the security of sensitive information. The scenario-based testing of LLM chatbots highlighted additional issues. Some chatbots were consistently reassuring, used emojis and names to add a personal touch, and were quick to suggest seeking professional help. However, there were limitations such as inconsistent tone, occasional inappropriate responses (e.g., casual or romantic), and a lack of crisis sensitivity, particularly in recognizing red flag language and escalating responses appropriately. These findings can inform both the technology and mental health care industries on how to better utilize AI chatbots to support individuals during challenging emotional periods.
♻ ☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
♻ ☆ 'Studies for': A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model NeurIPS
This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/
comment: Accepted at NeurIPS Creative AI Track 2025, 9 pages, 6 figures, 1 table, Demo page: https://sony.github.io/studies-for/
♻ ☆ DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations -- inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% -- preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using STAR-CCM+${}^\unicode{xAE}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% -- a five-fold improvement over existing datasets -- through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.
♻ ☆ R$^2$ec: Towards Large Recommender Models with Reasoning
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. In this work, we propose R$^2$ec, a unified large recommender model with intrinsic reasoning capability. R$^2$ec introduces a dual-head architecture that supports both reasoning chain generation and efficient item prediction in a single model, significantly reducing inference latency. To overcome the lack of annotated reasoning data, we design RecPO, a reinforcement learning framework that optimizes reasoning and recommendation jointly with a novel fused reward mechanism. Extensive experiments on three datasets demonstrate that R$^2$ec outperforms traditional, LLM-based, and reasoning-augmented recommender baselines, while further analyses validate its competitive efficiency among conventional LLM-based recommender baselines and strong adaptability to diverse recommendation scenarios. Code and checkpoints available at https://github.com/YRYangang/RRec.
comment: Accepted by Neurips 2025
♻ ☆ Don't throw the baby out with the bathwater: How and why deep learning for ARC
The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.
comment: 13 pages, 6 figures
♻ ☆ Training a Generally Curious Agent ICML 2025
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
comment: ICML 2025. Project Website: https://paprika-llm.github.io
♻ ☆ HiRA: A Hierarchical Reasoning Framework for Decoupled Planning and Execution in Deep Search
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
comment: 9 pages
♻ ☆ E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework E2Rank, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
comment: Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
♻ ☆ Multilingual State Space Models for Structured Question Answering in Indic Languages NAACL 2025
The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
comment: NAACL 2025 Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT)
♻ ☆ Scaling Diffusion Transformers Efficiently via $μ$P NeurIPS 2025
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
comment: Accepted by NeurIPS 2025, 38 pages, 10 figures, 17 tables
♻ ☆ Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Despite remarkable progress in recent years, vision language models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models selectively predict, that is, models respond only when they are sufficiently confident. Unfortunately, Bayesian methods are often assumed to be costly and ineffective for large models, and so far there exists little evidence to show otherwise, especially for multimodal applications. Here, we show the effectiveness and competitive edge of variational Bayes for selective prediction in VQA for the first time. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample can yield more reliable answers than those obtained by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
comment: under review at TMLR
♻ ☆ PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features
High-dimensional decision-making tasks, such as business partner selection, involve evaluating large candidate pools with heterogeneous numerical, categorical, and textual features. While large language models (LLMs) offer strong in-context reasoning capabilities, single-agent or debate-style systems often struggle with scalability and consistency in such settings. We propose PartnerMAS, a hierarchical multi-agent framework that decomposes evaluation into three layers: a Planner Agent that designs strategies, Specialized Agents that perform role-specific assessments, and a Supervisor Agent that integrates their outputs. To support systematic evaluation, we also introduce a curated benchmark dataset of venture capital co-investments, featuring diverse firm attributes and ground-truth syndicates. Across 140 cases, PartnerMAS consistently outperforms single-agent and debate-based multi-agent baselines, achieving up to 10--15\% higher match rates. Analysis of agent reasoning shows that planners are most responsive to domain-informed prompts, specialists produce complementary feature coverage, and supervisors play an important role in aggregation. Our findings demonstrate that structured collaboration among LLM agents can generate more robust outcomes than scaling individual models, highlighting PartnerMAS as a promising framework for high-dimensional decision-making in data-rich domains.
♻ ☆ MindSearch: Mimicking Human Minds Elicits Deep AI Searcher ICLR2025
Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.
comment: ICLR2025. Project Page: https://mindsearch.netlify.app Code: https://github.com/InternLM/MindSearch
♻ ☆ SVTime: Small Time Series Forecasting Models Informed by "Physics" of Large Vision Model Forecasters
Time series AI is crucial for analyzing dynamic web content, driving a surge of pre-trained large models known for their strong knowledge encoding and transfer capabilities across diverse tasks. However, given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability. For a specific task, a compact yet specialized, high-performing model may be more practical and affordable, especially for resource-constrained users such as small businesses. This motivates the question: Can we build cost-effective lightweight models with large-model-like performance on core tasks such as forecasting? This paper addresses this question by introducing SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF). Recently, LVMs have been shown as powerful tools for LTSF. We identify a set of key inductive biases of LVM forecasters -- analogous to the "physics" governing their behaviors in LTSF -- and design small models that encode these biases through meticulously crafted linear layers and constraint functions. Across 21 baselines spanning lightweight, complex, and pre-trained large models on 8 benchmark datasets, SVTime outperforms state-of-the-art (SOTA) lightweight models and rivals large models with 10^3 fewer parameters than LVMs, while enabling efficient training and inference in low-resource settings.
♻ ☆ SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks
The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures.
♻ ☆ SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
♻ ☆ Large Language Models for Combinatorial Optimization of Design Structure Matrix
In complex engineering systems, the dependencies among components or development activities are often modeled and analyzed using Design Structure Matrix (DSM). Reorganizing elements within a DSM to minimize feedback loops and enhance modularity or process efficiency constitutes a challenging combinatorial optimization (CO) problem in engineering design and operations. As problem sizes increase and dependency networks become more intricate, traditional optimization methods that rely solely on mathematical heuristics often fail to capture the contextual nuances and struggle to deliver effective solutions. In this study, we explore the potential of Large Language Models (LLMs) to address such CO problems by leveraging their capabilities for advanced reasoning and contextual understanding. We propose a novel LLM-based framework that integrates network topology with contextual domain knowledge for iterative optimization of DSM sequencing-a common CO problem. Experiments on various DSM cases demonstrate that our method consistently achieves faster convergence and superior solution quality compared to both stochastic and deterministic baselines. Notably, incorporating contextual domain knowledge significantly enhances optimization performance regardless of the chosen LLM backbone. These findings highlight the potential of LLMs to solve complex engineering CO problems by combining semantic and mathematical reasoning. This approach paves the way towards a new paradigm in LLM-based engineering design optimization.
comment: 20 pages, 6 figures
♻ ☆ On-device Computation of Single-lead ECG Parameters for Real-time Remote Cardiac Health Assessment: A Real-world Validation Study
Accurate, continuous out-of-hospital electrocardiogram (ECG) parameter measurement is vital for real-time cardiac health monitoring and telemedicine. On-device computation of single-lead ECG parameters enables timely assessment without reliance on centralized data processing, advancing personalized, ubiquitous cardiac care-yet comprehensive validation across heterogeneous real-world populations remains limited. This study validated the on-device algorithm FeatureDB (https://github.com/PKUDigitalHealth/FeatureDB) using two datasets: HeartVoice-ECG-lite (369 participants with single-lead ECGs annotated by two physicians) and PTB-XL/PTB-XL+ (21,354 patients with 12-lead ECGs and physicians' diagnostic annotations). FeatureDB computed PR, QT, and QTc intervals, with accuracy evaluated against physician annotations via mean absolute error (MAE), correlation analysis, and Bland-Altman analysis. Diagnostic performance for first-degree atrioventricular block (AVBI, PR-based) and long QT syndrome (LQT, QTc-based) was benchmarked against commercial 12-lead systems (12SL, Uni-G) and open-source algorithm Deli, using AUC, accuracy, sensitivity, and specificity. Results showed high concordance with expert annotations (Pearson correlations: 0.836-0.960), MAEs matching inter-observer variability, and minimal bias. AVBI AUC reached 0.787 (12SL: 0.859; Uni-G: 0.812; Deli: 0.501); LQT AUC was 0.684 (12SL: 0.716; Uni-G: 0.605; Deli: 0.569)-comparable to commercial tools and superior to open-source alternatives. FeatureDB delivers physician-level parameter accuracy and commercial-grade abnormality detection via single-lead devices, supporting scalable telemedicine, decentralized cardiac screening, and continuous monitoring in community and outpatient settings.
♻ ☆ Continual Vision-and-Language Navigation
Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
♻ ☆ Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
♻ ☆ DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
♻ ☆ Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting
Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.
♻ ☆ AVA: Towards Agentic Video Analytics with Vision Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
♻ ☆ Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms
When online sellers use AI learning algorithms to automatically compete on e-commerce platforms, there is concern that they will learn to coordinate on higher than competitive prices. However, this concern was primarily raised in single-dimension price competition. We investigate whether this prediction holds when sellers make pricing and advertising decisions together, i.e., two-dimensional decisions. We analyze competition in multi-agent reinforcement learning, and use a large-scale dataset from Amazon.com to provide empirical evidence. We show that when consumers have high search costs, learning algorithms can coordinate on prices lower than competitive prices, facilitating a win-win-win for consumers, sellers, and platforms. This occurs because algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices and enlarging demand on the platform. We also show that our results generalize to any learning algorithm that uses exploration of price and advertising bids. Consistent with our predictions, an empirical analysis shows that price levels exhibit a negative interaction between estimated consumer search costs and algorithm usage index. We analyze the platform's strategic response and find that reserve price adjustments will not increase platform profits, but commission adjustments will, while maintaining the beneficial outcomes for both sellers and consumers.
♻ ☆ Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Semantic interpretability in Reinforcement Learning (RL) enables transparency and verifiability of decision-making. Achieving semantic interpretability in reinforcement learning requires (1) a feature space composed of human-understandable concepts and (2) a policy that is interpretable and verifiable. However, constructing such a feature space has traditionally relied on manual human specification, which often fails to generalize to unseen environments. Moreover, even when interpretable features are available, most reinforcement learning algorithms employ black-box models as policies, thereby hindering transparency. We introduce interpretable Tree-based Reinforcement learning via Automated Concept Extraction (iTRACE), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and train a interpretable tree-based model via RL. To address the impracticality of running VLMs in RL loops, we distill their outputs into a lightweight model. By leveraging Vision-Language Models (VLMs) to automate tree-based reinforcement learning, iTRACE loosens the reliance the need for human annotation that is traditionally required by interpretable models. In addition, it addresses key limitations of VLMs alone, such as their lack of grounding in action spaces and their inability to directly optimize policies. We evaluate iTRACE across three domains: Atari games, grid-world navigation, and driving. The results show that iTRACE outperforms other interpretable policy baselines and matches the performance of black-box policies on the same interpretable feature space.
♻ ☆ Face Spoofing Detection using Deep Learning
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
comment: The author's school has a conflict of interest regarding the submission of this article prior to his graduation thesis submission
♻ ☆ RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm
Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .
♻ ☆ Context Tuning for In-Context Optimization ICML 2025
We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for LLMs, they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
comment: A short version of this paper was accepted at ICML 2025 Workshop on Test-Time Adaptation
♻ ☆ High Resolution Seismic Waveform Generation using Denoising Diffusion
Accurate prediction and synthesis of seismic waveforms are crucial for seismic-hazard assessment and earthquake-resistant infrastructure design. Existing prediction methods, such as ground-motion models and physics-based wave-field simulations, often fail to capture the full complexity of seismic wavefields, particularly at higher frequencies. This study introduces HighFEM, a novel, computationally efficient, and scalable (i.e., capable of generating many seismograms simultaneously) generative model for high-frequency seismic-waveform generation. Our approach leverages a spectrogram representation of the seismic-waveform data, which is reduced to a lower-dimensional manifold via an autoencoder. A state-of-the-art diffusion model is trained to generate this latent representation conditioned on key input parameters: earthquake magnitude, recording distance, site conditions, hypocenter depth, and azimuthal gap. The model generates waveforms with frequency content up to 50 Hz. Any scalar ground-motion statistic, such as peak ground-motion amplitudes and spectral accelerations, can be readily derived from the synthesized waveforms. We validate our model using commonly employed seismological metrics and performance metrics from image-generation studies. Our results demonstrate that the openly available model can generate realistic high-frequency seismic waveforms across a wide range of input parameters, even in data-sparse regions. For the scalar ground-motion statistics commonly used in seismic-hazard and earthquake-engineering studies, we show that our model accurately reproduces both the median trends of the real data and their variability. To evaluate and compare the growing number of these and similar Generative Waveform Models (GWMs), we argue that they should be openly available and included in community ground-motion-model evaluation efforts.
♻ ☆ LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems.
♻ ☆ LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space
Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test im- ages does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.
♻ ☆ LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
♻ ☆ Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research NeurIPS 2025
"Machine unlearning" is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model's parameters, e.g., a particular individual's personal data or the inclusion of copyrighted content in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.
comment: NeurIPS 2025 (Oral)
♻ ☆ A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models
Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning? In this work, we investigate this question through a theoretical analysis of the loss landscape geometry. We focus on networks built using permutation representations, which we can view as a subset of unconstrained MLPs. Importantly, we show that the parameter symmetries of the unconstrained model has nontrivial effects on the loss landscape of the equivariant subspace and under certain conditions can provably prevent learning of the global minima. Further, we empirically demonstrate in such cases, relaxing to an unconstrained MLP can sometimes solve the issue. Interestingly, the weights eventually found via relaxation corresponds to a different choice of group representation in the hidden layer. From this, we draw 3 key takeaways. (1) By viewing the unconstrained version of an architecture, we can uncover hidden parameter symmetries which were broken by choice of constraint enforcement (2) Hidden symmetries give important insights on loss landscapes and can induce critical points and even minima (3) Hidden symmetry induced minima can sometimes be escaped by constraint relaxation and we observe the network jumps to a different choice of constraint enforcement. Effective equivariance relaxation may require rethinking the fixed choice of group representation in the hidden layers.
♻ ☆ Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT
This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy. Relevant source code is provided in the accompanying GitHub repository\footnote{https://github.com/tianheng-ling/TinyTransformer4TS}.
comment: Accepted by 2024 IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT) and got best paper award. 7 pages, 3 figures, 4 tables
♻ ☆ Model Alignment Search
When can we say that two neural systems perform a task in the same way? What nuances do we miss when we fail to causally probe the representations of the systems, and how do we establish bidirectional causal relationships? In this work, we introduce a method that bidirectionally transfers neural activity between artificial neural networks and uses their resulting behavior as a measure of functional similarity. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information demonstrating that numbers can be encoded differently in recurrent models depending on the task, and we present another case study showing that MAS can reveal misalignment in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we augment the loss function with a counterfactual latent (CL) auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.
♻ ☆ Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning ICASSP 2025
Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.
comment: ICASSP 2025
♻ ☆ A Comprehensive Evaluation of Cognitive Biases in LLMs
We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms
comment: Published in "Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities"
♻ ☆ Evaluating Simplification Algorithms for Interpretability of Time Series Classification
In this work, we introduce metrics to evaluate the use of simplified time series in the context of interpretability of a TSC - a Time Series Classifier. Such simplifications are important because time series data, in contrast to text and image data, are not intuitively under- standable to humans. These metrics are related to the complexity of the simplifications - how many segments they contain - and to their loyalty - how likely they are to maintain the classification of the original time series. We focus on simplifications that select a subset of the original data points, and show that these typically have high Shapley value, thereby aiding interpretability. We employ these metrics to experimentally evaluate four distinct simplification algorithms, across several TSC algorithms and across datasets of varying characteristics, from seasonal or stationary to short or long. We subsequently perform a human-grounded evaluation with forward simulation, that confirms also the practical utility of the introduced metrics to evaluate the use of simplifications in the context of inter- pretability of TSC. Our findings are summarized in a framework for deciding, for a given TSC, if the various simplifications are likely to aid in its interpretability.
♻ ☆ Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses EMNLP
Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.
comment: Accepted by 2025 EMNLP industry track
Machine Learning 199
☆ Continuous Autoregressive Language Models
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
☆ PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels.Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
☆ On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection
Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks, including summarization, translation, completion, and code generation. However, detecting code vulnerabilities remains a challenging task for LLMs. An effective way to improve LLM performance is in-context learning (ICL) - providing few-shot examples similar to the query, along with correct answers, can improve an LLM's ability to generate correct solutions. However, choosing the few-shot examples appropriately is crucial to improving model performance. In this paper, we explore two criteria for choosing few-shot examples for ICL used in the code vulnerability detection task. The first criterion considers if the LLM (consistently) makes a mistake or not on a sample with the intuition that LLM performance on a sample is informative about its usefulness as a few-shot example. The other criterion considers similarity of the examples with the program under query and chooses few-shot examples based on the $k$-nearest neighbors to the given sample. We perform evaluations to determine the benefits of these criteria individually as well as under various combinations, using open-source models on multiple datasets.
☆ MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design
Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
comment: 21 pages
☆ Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements
Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.
☆ Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
☆ Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
☆ Imbalanced Classification through the Lens of Spurious Correlations
Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
☆ Bayesian Optimization on Networks
This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network's geometry, we adopt Whittle-Mat\'ern Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Mat\'ern prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.
comment: 36 pages, 6 figures; includes appendices
☆ SpecAttn: Speculating Sparse Attention NeurIPS 2025
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
comment: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
☆ Enhancing software product lines with machine learning components
Modern software systems increasingly integrate machine learning (ML) due to its advancements and ability to enhance data-driven decision-making. However, this integration introduces significant challenges for software engineering, especially in software product lines (SPLs), where managing variability and reuse becomes more complex with the inclusion of ML components. Although existing approaches have addressed variability management in SPLs and the integration of ML components in isolated systems, few have explored the intersection of both domains. Specifically, there is limited support for modeling and managing variability in SPLs that incorporate ML components. To bridge this gap, this article proposes a structured framework designed to extend Software Product Line engineering, facilitating the integration of ML components. It facilitates the design of SPLs with ML capabilities by enabling systematic modeling of variability and reuse. The proposal has been partially implemented with the VariaMos tool.
comment: pp. 73-94, 2 figures
☆ Panprediction: Optimal Predictions for Any Downstream Task and Loss
Supervised learning is classically formulated as training a model to minimize a fixed loss function over a fixed distribution, or task. However, an emerging paradigm instead views model training as extracting enough information from data so that the model can be used to minimize many losses on many downstream tasks. We formalize a mathematical framework for this paradigm, which we call panprediction, and study its statistical complexity. Formally, panprediction generalizes omniprediction and sits upstream from multi-group learning, which respectively focus on predictions that generalize to many downstream losses or many downstream tasks, but not both. Concretely, we design algorithms that learn deterministic and randomized panpredictors with $\tilde{O}(1/\varepsilon^3)$ and $\tilde{O}(1/\varepsilon^2)$ samples, respectively. Our results demonstrate that under mild assumptions, simultaneously minimizing infinitely many losses on infinitely many tasks can be as statistically easy as minimizing one loss on one task. Along the way, we improve the best known sample complexity guarantee of deterministic omniprediction by a factor of $1/\varepsilon$, and match all other known sample complexity guarantees of omniprediction and multi-group learning. Our key technical ingredient is a nearly lossless reduction from panprediction to a statistically efficient notion of calibration, called step calibration.
comment: 25 pages
☆ ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling
Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.
☆ Learned Static Function Data Structures
We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.
☆ Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
☆ Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
☆ Optimal Convergence Analysis of DDPM for General Distributions
Score-based diffusion models have achieved remarkable empirical success in generating high-quality samples from target data distributions. Among them, the Denoising Diffusion Probabilistic Model (DDPM) is one of the most widely used samplers, generating samples via estimated score functions. Despite its empirical success, a tight theoretical understanding of DDPM -- especially its convergence properties -- remains limited. In this paper, we provide a refined convergence analysis of the DDPM sampler and establish near-optimal convergence rates under general distributional assumptions. Specifically, we introduce a relaxed smoothness condition parameterized by a constant $L$, which is small for many practical distributions (e.g., Gaussian mixture models). We prove that the DDPM sampler with accurate score estimates achieves a convergence rate of $$\widetilde{O}\left(\frac{d\min\{d,L^2\}}{T^2}\right)~\text{in Kullback-Leibler divergence},$$ where $d$ is the data dimension, $T$ is the number of iterations, and $\widetilde{O}$ hides polylogarithmic factors in $T$. This result substantially improves upon the best-known $d^2/T^2$ rate when $L < \sqrt{d}$. By establishing a matching lower bound, we show that our convergence analysis is tight for a wide array of target distributions. Moreover, it reveals that DDPM and DDIM share the same dependence on $d$, raising an interesting question of why DDIM often appears empirically faster.
☆ Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
☆ AstuteRAG-FQA: Task-Aware Retrieval-Augmented Generation Framework for Proprietary Data Challenges in Financial Question Answering
Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.
☆ Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs
Understanding the structural and cognitive underpinnings of musical compositions remains a key challenge in music theory and computational musicology. While traditional methods focus on harmony and rhythm, cognitive models such as the Implication-Realization (I-R) model and Temporal Gestalt theory offer insight into how listeners perceive and anticipate musical structure. This study presents a graph-based computational approach that operationalizes these models by segmenting melodies into perceptual units and annotating them with I-R patterns. These segments are compared using Dynamic Time Warping and organized into k-nearest neighbors graphs to model intra- and inter-segment relationships. Each segment is represented as a node in the graph, and nodes are further labeled with melodic expectancy values derived from Schellenberg's two-factor I-R model-quantifying pitch proximity and pitch reversal at the segment level. This labeling enables the graphs to encode both structural and cognitive information, reflecting how listeners experience musical tension and resolution. To evaluate the expressiveness of these graphs, we apply the Weisfeiler-Lehman graph kernel to measure similarity between and within compositions. Results reveal statistically significant distinctions between intra- and inter-graph structures. Segment-level analysis via multidimensional scaling confirms that structural similarity at the graph level reflects perceptual similarity at the segment level. Graph2vec embeddings and clustering demonstrate that these representations capture stylistic and structural features that extend beyond composer identity. These findings highlight the potential of graph-based methods as a structured, cognitively informed framework for computational music analysis, enabling a more nuanced understanding of musical structure and style through the lens of listener perception.
comment: 8 pages, 11 figures
☆ TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
☆ Active transfer learning for structural health monitoring
Data for training structural health monitoring (SHM) systems are often expensive and/or impractical to obtain, particularly for labelled data. Population-based SHM (PBSHM) aims to address this limitation by leveraging data from multiple structures. However, data from different structures will follow distinct distributions, potentially leading to large generalisation errors for models learnt via conventional machine learning methods. To address this issue, transfer learning -- in the form of domain adaptation (DA) -- can be used to align the data distributions. Most previous approaches have only considered \emph{unsupervised} DA, where no labelled target data are available; they do not consider how to incorporate these technologies in an online framework -- updating as labels are obtained throughout the monitoring campaign. This paper proposes a Bayesian framework for DA in PBSHM, that can improve unsupervised DA mappings using a limited quantity of labelled target data. In addition, this model is integrated into an active sampling strategy to guide inspections to select the most informative observations to label -- leading to further reductions in the required labelled data to learn a target classifier. The effectiveness of this methodology is evaluated on a population of experimental bridges. Specifically, this population includes data corresponding to several damage states, as well as, a comprehensive set of environmental conditions. It is found that combining transfer learning and active learning can improve data efficiency when learning classification models in label-scarce scenarios. This result has implications for data-informed operation and maintenance of structures, suggesting a reduction in inspections over the operational lifetime of a structure -- and therefore a reduction in operational costs -- can be achieved.
☆ Leveraging Generic Time Series Foundation Models for EEG Classification
Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
☆ Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs NeurIPS 2025
The conjugate gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems Ax=b, where effective preconditioners are crucial for fast convergence. Traditional preconditioners rely on prescribed algorithms to offer rigorous theoretical guarantees, while limiting their ability to exploit optimization from data. Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction. However, their reliance on incomplete factorization leads to significant challenges: the associated triangular solve hinders GPU parallelization in practice, and introduces long-range dependencies which are difficult for GNNs to model. To address these issues, we propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners, which avoids triangular solves and requires only two matrix-vector products at each CG step. The locality of matrix-vector product is compatible with the local propagation mechanism of GNNs. The flexibility of GNNs also allows our approach to be applied in a wide range of scenarios. Furthermore, we introduce a statistics-based scale-invariant loss function. Its design matches CG's property that the convergence rate depends on the condition number, rather than the absolute scale of A, leading to improved performance of the learned preconditioner. Evaluations on three PDE-derived datasets and one synthetic dataset demonstrate that our method outperforms standard preconditioners (Diagonal, IC, and traditional SPAI) and previous learning-based preconditioners on GPUs. We reduce solution time on GPUs by 40%-53% (68%-113% faster), along with better condition numbers and superior generalization performance. Source code available at https://github.com/Adversarr/LearningSparsePreconditioner4GPU
comment: NeurIPS 2025, poster
☆ BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization ICDM
Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.
comment: Accepted at the IEEE International Conference on Data Mining (ICDM) 2025, Washington, DC, USA
☆ Asynchronous Risk-Aware Multi-Agent Packet Routing for Ultra-Dense LEO Satellite Networks
The rise of ultra-dense LEO constellations creates a complex and asynchronous network environment, driven by their massive scale, dynamic topologies, and significant delays. This unique complexity demands an adaptive packet routing algorithm that is asynchronous, risk-aware, and capable of balancing diverse and often conflicting QoS objectives in a decentralized manner. However, existing methods fail to address this need, as they typically rely on impractical synchronous decision-making and/or risk-oblivious approaches. To tackle this gap, we introduce PRIMAL, an event-driven multi-agent routing framework designed specifically to allow each satellite to act independently on its own event-driven timeline, while managing the risk of worst-case performance degradation via a principled primal-dual approach. This is achieved by enabling agents to learn the full cost distribution of the targeted QoS objectives and constrain tail-end risks. Extensive simulations on a LEO constellation with 1584 satellites validate its superiority in effectively optimizing latency and balancing load. Compared to a recent risk-oblivious baseline, it reduces queuing delay by over 70%, and achieves a nearly 12 ms end-to-end delay reduction in loaded scenarios. This is accomplished by resolving the core conflict between naive shortest-path finding and congestion avoidance, highlighting such autonomous risk-awareness as a key to robust routing.
☆ DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm
To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use R\'enyi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/DP-FedPGN
comment: 21 pages, 8 figures
☆ pDANSE: Particle-based Data-driven Nonlinear State Estimation from Nonlinear Measurements
We consider the problem of designing a data-driven nonlinear state estimation (DANSE) method that uses (noisy) nonlinear measurements of a process whose underlying state transition model (STM) is unknown. Such a process is referred to as a model-free process. A recurrent neural network (RNN) provides parameters of a Gaussian prior that characterize the state of the model-free process, using all previous measurements at a given time point. In the case of DANSE, the measurement system was linear, leading to a closed-form solution for the state posterior. However, the presence of a nonlinear measurement system renders a closed-form solution infeasible. Instead, the second-order statistics of the state posterior are computed using the nonlinear measurements observed at the time point. We address the nonlinear measurements using a reparameterization trick-based particle sampling approach, and estimate the second-order statistics of the state posterior. The proposed method is referred to as particle-based DANSE (pDANSE). The RNN of pDANSE uses sequential measurements efficiently and avoids the use of computationally intensive sequential Monte-Carlo (SMC) and/or ancestral sampling. We describe the semi-supervised learning method for pDANSE, which transitions to unsupervised learning in the absence of labeled data. Using a stochastic Lorenz-$63$ system as a benchmark process, we experimentally demonstrate the state estimation performance for four nonlinear measurement systems. We explore cubic nonlinearity and a camera-model nonlinearity where unsupervised learning is used; then we explore half-wave rectification nonlinearity and Cartesian-to-spherical nonlinearity where semi-supervised learning is used. The performance of state estimation is shown to be competitive vis-\`a-vis particle filters that have complete knowledge of the STM of the Lorenz-$63$ system.
comment: 11 pages, 10 figures, under review at IEEE Transactions on Signal Processing
☆ Minimax-Optimal Two-Sample Test with Sliced Wasserstein
We study the problem of nonparametric two-sample testing using the sliced Wasserstein (SW) distance. While prior theoretical and empirical work indicates that the SW distance offers a promising balance between strong statistical guarantees and computational efficiency, its theoretical foundations for hypothesis testing remain limited. We address this gap by proposing a permutation-based SW test and analyzing its performance. The test inherits finite-sample Type I error control from the permutation principle. Moreover, we establish non-asymptotic power bounds and show that the procedure achieves the minimax separation rate $n^{-1/2}$ over multinomial and bounded-support alternatives, matching the optimal guarantees of kernel-based tests while building on the geometric foundations of Wasserstein distances. Our analysis further quantifies the trade-off between the number of projections and statistical power. Finally, numerical experiments demonstrate that the test combines finite-sample validity with competitive power and scalability, and -- unlike kernel-based tests, which require careful kernel tuning -- it performs consistently well across all scenarios we consider.
☆ InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
☆ FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L \Delta \sigma_l^2)/(S K R \epsilon^2)}+(L \Delta)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.
☆ Thought Branches: Interpreting LLM Reasoning Requires Resampling
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
comment: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip
☆ Simplex-to-Euclidean Bijections for Categorical Flow Matching
We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.
☆ Spectral Neural Graph Sparsification
Graphs are central to modeling complex systems in domains such as social networks, molecular chemistry, and neuroscience. While Graph Neural Networks, particularly Graph Convolutional Networks, have become standard tools for graph learning, they remain constrained by reliance on fixed structures and susceptibility to over-smoothing. We propose the Spectral Preservation Network, a new framework for graph representation learning that generates reduced graphs serving as faithful proxies of the original, enabling downstream tasks such as community detection, influence propagation, and information diffusion at a reduced computational cost. The Spectral Preservation Network introduces two key components: the Joint Graph Evolution layer and the Spectral Concordance loss. The former jointly transforms both the graph topology and the node feature matrix, allowing the structure and attributes to evolve adaptively across layers and overcoming the rigidity of static neighborhood aggregation. The latter regularizes these transformations by enforcing consistency in both the spectral properties of the graph and the feature vectors of the nodes. We evaluate the effectiveness of Spectral Preservation Network on node-level sparsification by analyzing well-established metrics and benchmarking against state-of-the-art methods. The experimental results demonstrate the superior performance and clear advantages of our approach.
☆ MVeLMA: Multimodal Vegetation Loss Modeling Architecture for Predicting Post-fire Vegetation Loss SP
Understanding post-wildfire vegetation loss is critical for developing effective ecological recovery strategies and is often challenging due to the extended time and effort required to capture the evolving ecosystem features. Recent works in this area have not fully explored all the contributing factors, their modalities, and interactions with each other. Furthermore, most research in this domain is limited by a lack of interpretability in predictive modeling, making it less useful in real-world settings. In this work, we propose a novel end-to-end ML pipeline called MVeLMA (\textbf{M}ultimodal \textbf{Ve}getation \textbf{L}oss \textbf{M}odeling \textbf{A}rchitecture) to predict county-wise vegetation loss from fire events. MVeLMA uses a multimodal feature integration pipeline and a stacked ensemble-based architecture to capture different modalities while also incorporating uncertainty estimation through probabilistic modeling. Through comprehensive experiments, we show that our model outperforms several state-of-the-art (SOTA) and baseline models in predicting post-wildfire vegetation loss. Furthermore, we generate vegetation loss confidence maps to identify high-risk counties, thereby helping targeted recovery efforts. The findings of this work have the potential to inform future disaster relief planning, ecological policy development, and wildlife recovery management.
comment: Accepted for 2025 ACM SIGSPATIAL conference
☆ Atlas-Alignment: Making Interpretability Transferable Across Language Models
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
☆ Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne, unmanned, and space laser scanning
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elev.minimum, Elev.L3, lev.MAD.mode, Elev.mode, Elev.MAD.median, and Elev.skewness, are important for AGB estimation using ALSD and ULSD , while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha . The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
comment: 32 pages, 17 figures, research paper
☆ FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
☆ Interpretable Model-Aware Counterfactual Explanations for Random Forest
Despite their enormous predictive power, machine learning models are often unsuitable for applications in regulated industries such as finance, due to their limited capacity to provide explanations. While model-agnostic frameworks such as Shapley values have proved to be convenient and popular, they rarely align with the kinds of causal explanations that are typically sought after. Counterfactual case-based explanations, where an individual is informed of which circumstances would need to be different to cause a change in outcome, may be more intuitive and actionable. However, finding appropriate counterfactual cases is an open challenge, as is interpreting which features are most critical for the change in outcome. Here, we pose the question of counterfactual search and interpretation in terms of similarity learning, exploiting the representation learned by the random forest predictive model itself. Once a counterfactual is found, the feature importance of the explanation is computed as a function of which random forest partitions are crossed in order to reach it from the original instance. We demonstrate this method on both the MNIST hand-drawn digit dataset and the German credit dataset, finding that it generates explanations that are sparser and more useful than Shapley values.
comment: Presented at XAI-FIN-2025: International Joint Workshop on Explainable AI in Finance: Achieving Trustworthy Financial Decision-Making; November 15, 2025; Singapore
☆ Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
☆ On the Equivalence of Optimal Transport Problem and Action Matching with Optimal Vector Fields
Flow Matching (FM) method in generative modeling maps arbitrary probability distributions by constructing an interpolation between them and then learning the vector field that defines ODE for this interpolation. Recently, it was shown that FM can be modified to map distributions optimally in terms of the quadratic cost function for any initial interpolation. To achieve this, only specific optimal vector fields, which are typical for solutions of Optimal Transport (OT) problems, need to be considered during FM loss minimization. In this note, we show that considering only optimal vector fields can lead to OT in another approach: Action Matching (AM). Unlike FM, which learns a vector field for a manually chosen interpolation between given distributions, AM learns the vector field that defines ODE for an entire given sequence of distributions.
☆ Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
comment: Project page at https://ajmeek.github.io/cot_monitorability_website/
☆ FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
☆ Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation
Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.
☆ Reasoning Models Sometimes Output Illegible Chains of Thought
Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
☆ When AI Trading Agents Compete: Adverse Selection of Meta-Orders by Reinforcement Learning-Based Market Making
We investigate the mechanisms by which medium-frequency trading agents are adversely selected by opportunistic high-frequency traders. We use reinforcement learning (RL) within a Hawkes Limit Order Book (LOB) model in order to replicate the behaviours of high-frequency market makers. In contrast to the classical models with exogenous price impact assumptions, the Hawkes model accounts for endogenous price impact and other key properties of the market (Jain et al. 2024a). Given the real-world impracticalities of the market maker updating strategies for every event in the LOB, we formulate the high-frequency market making agent via an impulse control reinforcement learning framework (Jain et al. 2025). The RL used in the simulation utilises Proximal Policy Optimisation (PPO) and self-imitation learning. To replicate the adverse selection phenomenon, we test the RL agent trading against a medium frequency trader (MFT) executing a meta-order and demonstrate that, with training against the MFT meta-order execution agent, the RL market making agent learns to capitalise on the price drift induced by the meta-order. Recent empirical studies have shown that medium-frequency traders are increasingly subject to adverse selection by high-frequency trading agents. As high-frequency trading continues to proliferate across financial markets, the slippage costs incurred by medium-frequency traders are likely to increase over time. However, we do not observe that increased profits for the market making RL agent necessarily cause significantly increased slippages for the MFT agent.
☆ MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data
The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at https://github.com/DHLab-TSENG/MedM2T.
comment: This preprint version of the manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics (JBHI) for review. The implementation of MedM2T is available at https://github.com/DHLab-TSENG/MedM2T
☆ Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
☆ Binary Anomaly Detection in Streaming IoT Traffic under Concept Drift
With the growing volume of Internet of Things (IoT) network traffic, machine learning (ML)-based anomaly detection is more relevant than ever. Traditional batch learning models face challenges such as high maintenance and poor adaptability to rapid anomaly changes, known as concept drift. In contrast, streaming learning integrates online and incremental learning, enabling seamless updates and concept drift detection to improve robustness. This study investigates anomaly detection in streaming IoT traffic as binary classification, comparing batch and streaming learning approaches while assessing the limitations of current IoT traffic datasets. We simulated heterogeneous network data streams by carefully mixing existing datasets and streaming the samples one by one. Our results highlight the failure of batch models to handle concept drift, but also reveal persisting limitations of current datasets to expose model limitations due to low traffic heterogeneity. We also investigated the competitiveness of tree-based ML algorithms, well-known in batch anomaly detection, and compared it to non-tree-based ones, confirming the advantages of the former. Adaptive Random Forest achieved F1-score of 0.990 $\pm$ 0.006 at one-third the computational cost of its batch counterpart. Hoeffding Adaptive Tree reached F1-score of 0.910 $\pm$ 0.007, reducing computational cost by four times, making it a viable choice for online applications despite a slight trade-off in stability.
comment: 6 pages, 11 figures, 3 tables
☆ Temporal Cardiovascular Dynamics for Improved PPG-Based Heart Rate Estimation
The oscillations of the human heart rate are inherently complex and non-linear -- they are best described by mathematical chaos, and they present a challenge when applied to the practical domain of cardiovascular health monitoring in everyday life. In this work, we study the non-linear chaotic behavior of heart rate through mutual information and introduce a novel approach for enhancing heart rate estimation in real-life conditions. Our proposed approach not only explains and handles the non-linear temporal complexity from a mathematical perspective but also improves the deep learning solutions when combined with them. We validate our proposed method on four established datasets from real-life scenarios and compare its performance with existing algorithms thoroughly with extensive ablation experiments. Our results demonstrate a substantial improvement, up to 40\%, of the proposed approach in estimating heart rate compared to traditional methods and existing machine-learning techniques while reducing the reliance on multiple sensing modalities and eliminating the need for post-processing steps.
comment: ArXiv version of the IEEE JBHI paper (https://ieeexplore.ieee.org/abstract/document/11193709)
☆ Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments EMNLP 2025
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
comment: Accepted at EMNLP 2025 Main Track
☆ HiF-DTA: Hierarchical Feature Learning Network for Drug-Target Affinity Prediction
Accurate prediction of Drug-Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence-based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic-level, substructural-level, and molecular-level multi-scale features. We propose HiF-DTA, a hierarchical network that adopts a dual-pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi-scale to learn atomic, substructural, and molecular representations fused via a multi-scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF-DTA outperforms state-of-the-art baselines, with ablations confirming the importance of global-local extraction and multi-scale fusion.
comment: Accepted by International Conference on Bioinformatics and Biomedicine (BIBM 25)
☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
☆ Traceable Drug Recommendation over Medical Knowledge Graphs CIKM2025
Drug recommendation (DR) systems aim to support healthcare professionals in selecting appropriate medications based on patients' medical conditions. State-of-the-art approaches utilize deep learning techniques for improving DR, but fall short in providing any insights on the derivation process of recommendations -- a critical limitation in such high-stake applications. We propose TraceDR, a novel DR system operating over a medical knowledge graph (MKG), which ensures access to large-scale and high-quality information. TraceDR simultaneously predicts drug recommendations and related evidence within a multi-task learning framework, enabling traceability of medication recommendations. For covering a more diverse set of diseases and drugs than existing works, we devise a framework for automatically constructing patient health records and release DrugRec, a new large-scale testbed for DR.
comment: Accepted to MediKS@CIKM2025
☆ Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
☆ T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
comment: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures
☆ ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.
☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
☆ ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
comment: 23 pages, 13 figures, 7 tables
☆ Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
comment: 14 pages, 3 Figures
☆ Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
☆ FedSM: Robust Semantics-Guided Feature Mixup for Bias Reduction in Federated Learning with Long-Tail Data
Federated Learning (FL) enables collaborative model training across decentralized clients without sharing private data. However, FL suffers from biased global models due to non-IID and long-tail data distributions. We propose \textbf{FedSM}, a novel client-centric framework that mitigates this bias through semantics-guided feature mixup and lightweight classifier retraining. FedSM uses a pretrained image-text-aligned model to compute category-level semantic relevance, guiding the category selection of local features to mix-up with global prototypes to generate class-consistent pseudo-features. These features correct classifier bias, especially when data are heavily skewed. To address the concern of potential domain shift between the pretrained model and the data, we propose probabilistic category selection, enhancing feature diversity to effectively mitigate biases. All computations are performed locally, requiring minimal server overhead. Extensive experiments on long-tail datasets with various imbalanced levels demonstrate that FedSM consistently outperforms state-of-the-art methods in accuracy, with high robustness to domain shift and computational efficiency.
☆ Soft Task-Aware Routing of Experts for Equivariant Representation Learning NeurIPS 2025
Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at https://github.com/YonseiML/star.
comment: NeurIPS 2025
☆ Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness
We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
☆ Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models
Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model's final state. By collapsing a feature's role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model's learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.
☆ MDAS-GNN: Multi-Dimensional Spatiotemporal GNN with Spatial Diffusion for Urban Traffic Risk Forecasting
Traffic accidents represent a critical public health challenge, claiming over 1.35 million lives annually worldwide. Traditional accident prediction models treat road segments independently, failing to capture complex spatial relationships and temporal dependencies in urban transportation networks. This study develops MDAS-GNN, a Multi-Dimensional Attention-based Spatial-diffusion Graph Neural Network integrating three core risk dimensions: traffic safety, infrastructure, and environmental risk. The framework employs feature-specific spatial diffusion mechanisms and multi-head temporal attention to capture dependencies across different time horizons. Evaluated on UK Department for Transport accident data across Central London, South Manchester, and SE Birmingham, MDASGNN achieves superior performance compared to established baseline methods. The model maintains consistently low prediction errors across short, medium, and long-term periods, with particular strength in long-term forecasting. Ablation studies confirm that integrated multi-dimensional features outperform singlefeature approaches, reducing prediction errors by up to 40%. This framework provides civil engineers and urban planners with advanced predictive capabilities for transportation infrastructure design, enabling data-driven decisions for road network optimization, infrastructure resource improvements, and strategic safety interventions in urban development projects.
☆ Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications
Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.
☆ SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.
☆ A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions COLT 2025
In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).
comment: A minor algorithmic error in our paper presented on COLT 2025 has been corrected in this arXiv update. We also have updated the pseudo-code of the algorithm. Our theoretical analyses, as well as all theoretical bounds, remain unaffected by those changes
☆ FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction
Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.
☆ Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.
☆ Exploring Landscapes for Better Minima along Valleys
Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor "E" for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
comment: Neurips 2025 poster
☆ SERVIMON: AI-Driven Predictive Maintenance and Real-Time Monitoring for Astronomical Observatories
Objective: ServiMon is designed to offer a scalable and intelligent pipeline for data collection and auditing to monitor distributed astronomical systems such as the ASTRI Mini-Array. The system enhances quality control, predictive maintenance, and real-time anomaly detection for telescope operations. Methods: ServiMon integrates cloud-native technologies-including Prometheus, Grafana, Cassandra, Kafka, and InfluxDB-for telemetry collection and processing. It employs machine learning algorithms, notably Isolation Forest, to detect anomalies in Cassandra performance metrics. Key indicators such as read/write latency, throughput, and memory usage are continuously monitored, stored as time-series data, and preprocessed for feature engineering. Anomalies detected by the model are logged in InfluxDB v2 and accessed via Flux for real-time monitoring and visualization. Results: AI-based anomaly detection increases system resilience by identifying performance degradation at an early stage, minimizing downtime, and optimizing telescope operations. Additionally, ServiMon supports astrostatistical analysis by correlating telemetry with observational data, thus enhancing scientific data quality. AI-generated alerts also improve real-time monitoring, enabling proactive system management. Conclusion: ServiMon's scalable framework proves effective for predictive maintenance and real-time monitoring of astronomical infrastructures. By leveraging cloud and edge computing, it is adaptable to future large-scale experiments, optimizing both performance and cost. The combination of machine learning and big data analytics makes ServiMon a robust and flexible solution for modern and next-generation observational astronomy.
comment: Accepted for publication in IAU Symposium 397: Exploring the Universe with Artificial Intelligence (UniversAI 2025), Cambridge University Press. Editors: C. Sterken, J. Hearnshaw & D. Valls-Gabaud
☆ Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores
Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters. Effective tuning of these parameters is essential for adapting to diverse workloads and maximizing throughput while minimizing latency. Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations. Most tuning frameworks disregard the dependencies among parameters, assuming that each operates independently. This simplification prevents optimizers from leveraging relational effects across parameters, limiting their capacity to capture performancesensitive interactions. Moreover, to reduce the complexity of the high-dimensional search space, prior work often selects only the top few parameters for optimization, overlooking others that contribute meaningfully to performance. Bayesian Optimization (BO), the most common method for automatic tuning, is also constrained by its reliance on surrogate models, which can lead to unstable predictions and inefficient exploration. To overcome these limitations, we propose RelTune, a novel framework that represents parameter dependencies as a Relational Graph and learns GNN-based latent embeddings that encode performancerelevant semantics. RelTune further introduces Hybrid-Score-Guided Bayesian Optimization (HBO), which combines surrogate predictions with an Affinity Score measuring proximity to previously high-performing configurations. Experimental results on multiple DBMSs and workloads demonstrate that RelTune achieves faster convergence and higher optimization efficiency than conventional BO-based methods, achieving state-of-the-art performance across all evaluated scenarios.
comment: 13 pages
☆ FairAD: Computationally Efficient Fair Graph Clustering via Algebraic Distance CIKM 2025
Due to the growing concern about unsavory behaviors of machine learning models toward certain demographic groups, the notion of 'fairness' has recently drawn much attention from the community, thereby motivating the study of fairness in graph clustering. Fair graph clustering aims to partition the set of nodes in a graph into $k$ disjoint clusters such that the proportion of each protected group within each cluster is consistent with the proportion of that group in the entire dataset. It is, however, computationally challenging to incorporate fairness constraints into existing graph clustering algorithms, particularly for large graphs. To address this problem, we propose FairAD, a computationally efficient fair graph clustering method. It first constructs a new affinity matrix based on the notion of algebraic distance such that fairness constraints are imposed. A graph coarsening process is then performed on this affinity matrix to find representative nodes that correspond to $k$ clusters. Finally, a constrained minimization problem is solved to obtain the solution of fair clustering. Experiment results on the modified stochastic block model and six public datasets show that FairAD can achieve fair clustering while being up to 40 times faster compared to state-of-the-art fair graph clustering algorithms.
comment: ACM CIKM 2025
☆ Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring
This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.
comment: 12 pages, 3 figures
☆ AI Agents in Drug Discovery
Artificial intelligence (AI) agents are emerging as transformative tools in drug discovery, with the ability to autonomously reason, act, and learn through complicated research workflows. Building on large language models (LLMs) coupled with perception, computation, action, and memory tools, these agentic AI systems could integrate diverse biomedical data, execute tasks, carry out experiments via robotic platforms, and iteratively refine hypotheses in closed loops. We provide a conceptual and technical overview of agentic AI architectures, ranging from ReAct and Reflection to Supervisor and Swarm systems, and illustrate their applications across key stages of drug discovery, including literature synthesis, toxicity prediction, automated protocol generation, small-molecule synthesis, drug repurposing, and end-to-end decision-making. To our knowledge, this represents the first comprehensive work to present real-world implementations and quantifiable impacts of agentic AI systems deployed in operational drug discovery settings. Early implementations demonstrate substantial gains in speed, reproducibility, and scalability, compressing workflows that once took months into hours while maintaining scientific traceability. We discuss the current challenges related to data heterogeneity, system reliability, privacy, and benchmarking, and outline future directions towards technology in support of science and translation.
comment: 45 pages, 12 figures
☆ AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
☆ Group-Sensitive Offline Contextual Bandits
Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.
☆ Learning Generalizable Visuomotor Policy through Dynamics-Alignment
Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.
comment: 9 pages, 6 figures
☆ Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle
Bulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model's structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.
☆ QiNN-QJ: A Quantum-inspired Neural Network with Quantum Jump for Multimodal Sentiment Analysis
Quantum theory provides non-classical principles, such as superposition and entanglement, that inspires promising paradigms in machine learning. However, most existing quantum-inspired fusion models rely solely on unitary or unitary-like transformations to generate quantum entanglement. While theoretically expressive, such approaches often suffer from training instability and limited generalizability. In this work, we propose a Quantum-inspired Neural Network with Quantum Jump (QiNN-QJ) for multimodal entanglement modelling. Each modality is firstly encoded as a quantum pure state, after which a differentiable module simulating the QJ operator transforms the separable product state into the entangled representation. By jointly learning Hamiltonian and Lindblad operators, QiNN-QJ generates controllable cross-modal entanglement among modalities with dissipative dynamics, where structured stochasticity and steady-state attractor properties serve to stabilize training and constrain entanglement shaping. The resulting entangled states are projected onto trainable measurement vectors to produce predictions. In addition to achieving superior performance over the state-of-the-art models on benchmark datasets, including CMU-MOSI, CMU-MOSEI, and CH-SIMS, QiNN-QJ facilitates enhanced post-hoc interpretability through von-Neumann entanglement entropy. This work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modelling complex cross-modal correlations.
☆ Functional embeddings enable Aggregation of multi-area SEEG recordings over subjects and sessions ICLR 2026
Aggregating intracranial recordings across subjects is challenging since electrode count, placement, and covered regions vary widely. Spatial normalization methods like MNI coordinates offer a shared anatomical reference, but often fail to capture true functional similarity, particularly when localization is imprecise; even at matched anatomical coordinates, the targeted brain region and underlying neural dynamics can differ substantially between individuals. We propose a scalable representation-learning framework that (i) learns a subject-agnostic functional identity for each electrode from multi-region local field potentials using a Siamese encoder with contrastive objectives, inducing an embedding geometry that is locality-sensitive to region-specific neural signatures, and (ii) tokenizes these embeddings for a transformer that models inter-regional relationships with a variable number of channels. We evaluate this framework on a 20-subject dataset spanning basal ganglia-thalamic regions collected during flexible rest/movement recording sessions with heterogeneous electrode layouts. The learned functional space supports accurate within-subject discrimination and forms clear, region-consistent clusters; it transfers zero-shot to unseen channels. The transformer, operating on functional tokens without subject-specific heads or supervision, captures cross-region dependencies and enables reconstruction of masked channels, providing a subject-agnostic backbone for downstream decoding. Together, these results indicate a path toward large-scale, cross-subject aggregation and pretraining for intracranial neural data where strict task structure and uniform sensor placement are unavailable.
comment: Submitted to ICLR 2026
☆ Towards Understanding Self-play for LLM Reasoning
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
☆ MLPerf Automotive
We present MLPerf Automotive, the first standardized public benchmark for evaluating Machine Learning systems that are deployed for AI acceleration in automotive systems. Developed through a collaborative partnership between MLCommons and the Autonomous Vehicle Computing Consortium, this benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems. Existing benchmark suites cannot be utilized for these systems since automotive workloads have unique constraints including safety and real-time processing that distinguish them from the domains that previously introduced benchmarks target. Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols that enable consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration of the benchmark consists of automotive perception tasks in 2D object detection, 2D semantic segmentation, and 3D object detection. We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules. We also discuss the first round of benchmark submissions and the challenges involved in acquiring the datasets and the engineering efforts to develop the reference implementations. Our benchmark code is available at https://github.com/mlcommons/mlperf_automotive.
comment: 16 pages, 5 figures, 6 tables
☆ Towards a Measure of Algorithm Similarity
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
comment: 11 pages, many figures and images
☆ Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
comment: 19 pages
☆ Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models
Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
♻ ☆ On Uncertainty Calibration for Equivariant Functions
Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
comment: Under review at Transactions on Machine Learning Research (TMLR). Code is available at https://github.com/EdwardBerman/EquiUQ . Excited to share this paper, comments welcome :D. This version fixed a couple typos in the previous upload
♻ ☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
♻ ☆ Conformal Object Detection by Sequential Risk Control
Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in safety-critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc predictive uncertainty quantification procedure with statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold. First, we formally define the problem of Conformal Object Detection (COD). We introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control to two sequential tasks with two parameters, as required in the COD setting. Then, we present old and new loss functions and prediction sets suited to applying SeqCRC to different cases and certification requirements. Finally, we present a conformal toolkit for replication and further exploration of our method. Using this toolkit, we perform extensive experiments that validate our approach and emphasize trade-offs and other practical consequences.
comment: 29 pages, 12 figures
♻ ☆ Kernel conditional tests from learning-theoretic bounds NeurIPS 2025
We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct statistical tests of functionals of conditional distributions. These tests identify the inputs where the functionals differ with high probability, and include tests of conditional moments or two-sample tests. Our key idea is to transform confidence bounds of a learning method into a test of conditional expectations. We instantiate this principle for kernel ridge regression (KRR) with subgaussian noise. An intermediate data embedding then enables more general tests -- including conditional two-sample tests -- via kernel mean embeddings of distributions. To have guarantees in this setting, we generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds also circumvent the need for independent data, allowing for instance online sampling. To make our tests readily applicable in practice, we introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters. We illustrate the tests on examples, including one in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional testing on functionals, from theoretical guarantees to an algorithmic implementation, and advance the state of the art on confidence bounds for vector-valued least squares estimation.
comment: 46 pages, 8 figures, 9 tables. Accepted at NeurIPS 2025; to appear in the proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems. Reviews and discussion: https://openreview.net/forum?id=hJKDwf32Xu
♻ ☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
♻ ☆ Generative Adversarial Networks for High-Dimensional Item Factor Analysis: A Deep Adversarial Learning Algorithm
Advances in deep learning and representation learning have transformed item factor analysis (IFA) in the item response theory (IRT) literature by enabling more efficient and accurate parameter estimation. Variational Autoencoders (VAEs) have been one of the most impactful techniques in modeling high-dimensional latent variables in this context. However, the limited expressiveness of the inference model based on traditional VAEs can still hinder the estimation performance. We introduce Adversarial Variational Bayes (AVB) algorithms as an improvement to VAEs for IFA with improved flexibility and accuracy. By bridging the strengths of VAEs and Generative Adversarial Networks (GANs), AVB incorporates an auxiliary discriminator network to reframe the estimation process as a two-player adversarial game and removes the restrictive assumption of standard normal distributions in the inference model. Theoretically, AVB can achieve similar or higher likelihood compared to VAEs. A further enhanced algorithm, Importance-weighted Adversarial Variational Bayes (IWAVB) is proposed and compared with Importance-weighted Autoencoders (IWAE). In an exploratory analysis of empirical data, IWAVB demonstrated superior expressiveness by achieving a higher likelihood compared to IWAE. In confirmatory analysis with simulated data, IWAVB achieved similar mean-square error results to IWAE while consistently achieving higher likelihoods. When latent variables followed a multimodal distribution, IWAVB outperformed IWAE. With its innovative use of GANs, IWAVB is shown to have the potential to extend IFA to handle large-scale data, facilitating the potential integration of psychometrics and multimodal data analysis.
♻ ☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
♻ ☆ UdonCare: Hierarchy Pruning for Unseen Domain Discovery in Predictive Healthcare
Healthcare providers often divide patient populations into cohorts based on shared clinical factors, such as medical history, to deliver personalized healthcare services. This idea has also been adopted in clinical prediction models, where it presents a vital challenge: capturing both global and cohort-specific patterns while enabling model generalization to unseen domains. Addressing this challenge falls under the scope of domain generalization (DG). However, conventional DG approaches often struggle in clinical settings due to the absence of explicit domain labels and the inherent gap in medical knowledge. To address this, we propose UdonCare, a hierarchy-guided method that iteratively divides patients into latent domains and decomposes domain-invariant (label) information from patient data. Our method identifies patient domains by pruning medical ontologies (e.g. ICD-9-CM hierarchy). On two public datasets, MIMIC-III and MIMIC-IV, UdonCare shows superiority over eight baselines across four clinical prediction tasks with substantial domain gaps, highlighting the untapped potential of medical knowledge in guiding clinical domain generalization problems.
♻ ☆ I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models NeurIPS 2025
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
comment: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
♻ ☆ Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees
Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $\tau$-leaping samplers have become particularly popular due to their theoretical and empirical success. However, existing theoretical analyses of $\tau$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $\tau$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $\tau$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
♻ ☆ Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models
Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $\tau$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.
♻ ☆ Persistent Homology for Structural Characterization in Disordered Systems
We propose a unified framework based on persistent homology (PH) to characterize both local and global structures in disordered systems. It can simultaneously generate local and global descriptors using the same algorithm and data structure, and has shown to be highly effective and interpretable in predicting particle rearrangements and classifying global phases. We also demonstrated that using a single variable enables a linear SVM to achieve nearly perfect three-phase classification. Inspired by this discovery, we define a non-parametric metric, the Separation Index (SI), which not only achieves this classification without sacrificing significant performance but also establishes a connection between particle environments and the global phase structure. Our methods provide an effective framework for understanding and analyzing the properties of disordered materials, with broad potential applications in materials science and even wider studies of complex systems.
comment: 24 pages, 19 figures
♻ ☆ SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks EMNLP 2025
Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best balance reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments illustrate the effectiveness of our approach at aligning to preference proxies, including sentiment control, helpfulness and harmlessness, and summary quality. Our method obtains +10% and +3% win rate points in summarization and dialogue scenarios, respectively, without compromising model reasoning or the relevancy and faithfulness of the summary response.
comment: 27 pages, 9 figures, 5 tables. Accepted to EMNLP 2025
♻ ☆ Smooth Flow Matching
Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
comment: 66 pages, 8 figures
♻ ☆ HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
♻ ☆ Supervised Quadratic Feature Analysis: Information Geometry Approach for Dimensionality Reduction
Supervised dimensionality reduction maps labeled data into a low-dimensional feature space while preserving class discriminability. A common approach is to maximize a statistical measure of dissimilarity between classes in the feature space. Information geometry provides an alternative framework for measuring class dissimilarity, with the potential for improved insights and novel applications. Information geometry, which is grounded in Riemannian geometry, uses the Fisher information metric, a local measure of discriminability that induces the Fisher-Rao distance. Here, we present Supervised Quadratic Feature Analysis (SQFA), a linear dimensionality reduction method that maximizes Fisher-Rao distances between class-conditional distributions, under Gaussian assumptions. We motivate the Fisher-Rao distance as a good proxy for discriminability. We show that SQFA features support good classification performance with Quadratic Discriminant Analysis (QDA) on three real-world datasets. SQFA provides a novel framework for supervised dimensionality reduction, motivating future research in applying information geometry to machine learning and neuroscience.
comment: 24 pages, 15 figures
♻ ☆ Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems
On High-Performance Computing (HPC) systems, several hyperparameter configurations can be evaluated in parallel to speed up the Hyperparameter Optimization (HPO) process. State-of-the-art HPO methods follow a bandit-based approach and build on top of successive halving, where the final performance of a combination is estimated based on a lower than fully trained fidelity performance metric and more promising combinations are assigned more resources over time. Frequently, the number of epochs is treated as a resource, letting more promising combinations train longer. Another option is to use the number of workers as a resource and directly allocate more workers to more promising configurations via data-parallel training. This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA), which combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA). Scalability of this approach is shown on up to 1,024 Graphics Processing Units (GPUs) on modern HPC systems. It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD), and Additive Manufacturing (AM) domains, where performing more than one full training run is usually infeasible. Empirical results show that RASDA outperforms ASHA by a factor of up to 1.9 with respect to the runtime. At the same time, the solution quality of final ASHA models is maintained or even surpassed by the implicit batch size scheduling of RASDA. With RASDA, systematic HPO is applied to a terabyte-scale scientific dataset for the first time in the literature, enabling efficient optimization of complex models on massive scientific data. The implementation of RASDA is available on https://github.com/olympiquemarcel/rasda
♻ ☆ Data-Driven Stochastic Optimal Control in Reproducing Kernel Hilbert Spaces
This paper proposes a fully data-driven approach for optimal control of nonlinear control-affine systems represented by a stochastic diffusion. The focus is on the scenario where both the nonlinear dynamics and stage cost functions are unknown, while only a control penalty function and constraints are provided. To this end, we embed state probability densities into a reproducing kernel Hilbert space (RKHS) to leverage recent advances in operator regression, thereby identifying Markov transition operators associated with controlled diffusion processes. This operator learning approach integrates naturally with convex operator-theoretic Hamilton-Jacobi-Bellman recursions that scale linearly with state dimensionality, effectively solving a wide range of nonlinear optimal control problems. Numerical results demonstrate its ability to address diverse nonlinear control tasks, including the depth regulation of an autonomous underwater vehicle.
comment: author-submitted electronic preprint version: 19 pages, 5 figures, 3 tables
♻ ☆ A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees NeurIPS 2025
Finding an $\epsilon$-stationary point of a nonconvex function with a Lipschitz continuous Hessian is a central problem in optimization. Regularized Newton methods are a classical tool and have been studied extensively, yet they still face a trade-off between global and local convergence. Whether a parameter-free algorithm of this type can simultaneously achieve optimal global complexity and quadratic local convergence remains an open question. To bridge this long-standing gap, we propose a new class of regularizers constructed from the current and previous gradients, and leverage the conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. The proposed algorithm is adaptive, requiring no prior knowledge of the Hessian Lipschitz constant, and achieves a global complexity of $O(\epsilon^{-3/2})$ in terms of the second-order oracle calls, and $\tilde{O}(\epsilon^{-7/4})$ for Hessian-vector products, respectively. When the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results, including training the physics-informed neural networks, illustrate the competitiveness of our algorithm.
comment: NeurIPS 2025
♻ ☆ A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD
This article provides a practical introduction to kernel discrepancies, focusing on the Maximum Mean Discrepancy (MMD), the Hilbert-Schmidt Independence Criterion (HSIC), and the Kernel Stein Discrepancy (KSD). Various estimators for these discrepancies are presented, including the commonly-used V-statistics and U-statistics, as well as several forms of the more computationally-efficient incomplete U-statistics. The importance of the choice of kernel bandwidth is stressed, showing how it affects the behaviour of the discrepancy estimation. Adaptive estimators are introduced, which combine multiple estimators with various kernels, addressing the problem of kernel selection.
comment: 26 pages, 7 figures, v2 fixes Equation 16, v3 fixes cross-covariance notation p3
♻ ☆ Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased
When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data whose data generating process differs from new data. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration. Through investigating the latter issue, we made a surprising discovery - contrary to the widespread belief that decision trees are biased towards the majority class, they actually can be biased towards the minority class.
♻ ☆ RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.
comment: To appear in Nature Machine Intelligence. Includes the main and supplementary manuscript. Total of 70 pages, with a total of 9 Figures and 17 Tables
♻ ☆ DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions
We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.
♻ ☆ PoLAR: Polar-Decomposed Low-Rank Adapter Representation
We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading fine-tuning performance. To mitigate the underutilization of the allocated subspace, we propose PoLAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that PoLAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
♻ ☆ A tutorial on discovering and quantifying the effect of latent causal sources of multimodal EHR data
We provide an accessible description of a peer-reviewed generalizable causal machine learning pipeline to (i) discover latent causal sources of large-scale electronic health records observations, and (ii) quantify the source causal effects on clinical outcomes. We illustrate how imperfect multimodal clinical data can be processed, decomposed into probabilistic independent latent sources, and used to train taskspecific causal models from which individual causal effects can be estimated. We summarize the findings of the two real-world applications of the approach to date as a demonstration of its versatility and utility for medical discovery at scale.
comment: Accepted at the 1st Multimodal Representation Learning for Healthcare EurIPS 2025 Workshop (https://multimodal-rep-learning-for-health.github.io/)
♻ ☆ NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.
♻ ☆ Byzantine Resilient Federated Multi-Task Representation Learning
In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ two robust aggregation methods for client-server communication, Geometric Median and Krum. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through experiments using real-world datasets, including CIFAR-10 and FEMNIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.
♻ ☆ Convergence of continuous-time stochastic gradient descent with applications to deep neural networks
We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.
♻ ☆ ESTformer: Transformer utilising spatiotemporal dependencies for electroencephalogram super-resolution
Towards practical applications of Electroencephalography (EEG), lightweight acquisition devices garner significant attention. However, EEG channel selection methods are commonly data-sensitive and cannot establish a unified sound paradigm for EEG acquisition devices. Through reverse conceptualisation, we formulated EEG applications in an EEG super-resolution (SR) manner, but suffered from high computation costs, extra interpolation bias, and few insights into spatiotemporal dependency modelling. To this end, we propose ESTformer, an EEG SR framework that utilises spatiotemporal dependencies based on the transformer. ESTformer applies positional encoding methods and a multihead self-attention mechanism to the space and time dimensions, which can learn spatial structural correlations and temporal functional variations. ESTformer, with the fixed mask strategy, adopts a mask token to upsample low-resolution (LR) EEG data in the case of disturbance from mathematical interpolation methods. On this basis, we designed various transformer blocks to construct a spatial interpolation module (SIM) and a temporal reconstruction module (TRM). Finally, ESTformer cascades the SIM and TRM to capture and model the spatiotemporal dependencies for EEG SR with fidelity. Extensive experimental results on two EEG datasets show the effectiveness of ESTformer against previous state-of-the-art methods, demonstrating the versatility of the Transformer for EEG SR tasks. The superiority of the SR data was verified in an EEG-based person identification and emotion recognition task, achieving a 2% to 38% improvement compared with the LR data at different sampling scales.
comment: Accepted by Knowledge-Based Systems
♻ ☆ Generating Auxiliary Tasks with Reinforcement Learning
Auxiliary Learning (AL) is a form of multi-task learning in which a model trains on auxiliary tasks to boost performance on a primary objective. While AL has improved generalization across domains such as navigation, image classification, and NLP, it often depends on human-labeled auxiliary tasks that are costly to design and require domain expertise. Meta-learning approaches mitigate this by learning to generate auxiliary tasks, but typically rely on gradient based bi-level optimization, adding substantial computational and implementation overhead. We propose RL-AUX, a reinforcement-learning (RL) framework that dynamically creates auxiliary tasks by assigning auxiliary labels to each training example, rewarding the agent whenever its selections improve the performance on the primary task. We also explore learning per-example weights for the auxiliary loss. On CIFAR-100 grouped into 20 superclasses, our RL method outperforms human-labeled auxiliary tasks and matches the performance of a prominent bi-level optimization baseline. We present similarly strong results on other classification datasets. These results suggest RL is a viable path to generating effective auxiliary tasks.
♻ ☆ Accelerated Rates between Stochastic and Adversarial Online Convex Optimization NeurIPS 2022
Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the related expert and bandit settings. In the fully i.i.d. case, our regret bounds match the rates one would expect from results in stochastic acceleration, and we also recover the optimal stochastically accelerated rates via online-to-batch conversion. In the fully adversarial case our bounds gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.
comment: There is an unfixable mistake in the proof of Lemma 15, as kindly brought to our attention by Peng Zhao. The mistake is not present in the earlier NeurIPS 2022 conference version of the paper, which does not include this result
♻ ☆ Token Distillation: Attention-aware Input Embeddings For New Tokens
Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.
comment: Additional experiments + clearer method name compared to the May 2025 version
♻ ☆ Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing
This paper studies fast adversarial training against sparse adversarial perturbations bounded by $l_0$ norm. We demonstrate the challenges of employing $1$-step attacks on $l_0$ bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in $l_0$ adversarial training is caused by sub-optimal perturbation locations of $1$-step attack. Theoretical and empirical analyses reveal that the loss landscape of $l_0$ adversarial training is more craggy compared to its $l_\infty$, $l_2$ and $l_1$ counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS-$l_0$ that incorporates soft labels and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieve state-of-the-art performance, and narrow down the performance gap between $1$-step and multi-step adversarial training against sparse attacks.
♻ ☆ Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness ECAI 2025
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
comment: ECAI 2025
♻ ☆ Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
♻ ☆ Eliciting Secret Knowledge from Language Models
We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in all settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. Our white-box techniques based on logit lens and sparse autoencoders (SAEs) also consistently increase the success rate of the LLM auditor, but are less effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.
♻ ☆ RTNinja: a generalized machine learning framework for analyzing random telegraph noise signals in nanoelectronic devices
Random telegraph noise is a prevalent variability phenomenon in nanoelectronic devices, arising from stochastic carrier exchange at defect sites and critically impacting device reliability and performance. Conventional analysis techniques often rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets. Here, we introduce RTNinja, a generalized, fully automated machine learning framework for the unsupervised analysis of random telegraph noise signals. RTNinja deconvolves complex signals to identify the number and characteristics of hidden individual sources, without requiring prior knowledge of the system. The framework comprises two modular components: LevelsExtractor, which uses Bayesian inference and model selection to denoise and discretize the signal; and SourcesMapper, which infers source configurations through probabilistic clustering and optimization. To evaluate performance, we developed a Monte Carlo simulator that generates labeled datasets spanning broad signal-to-noise ratios and source complexities; across 7000 such datasets, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns. Our results demonstrate that RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability-centric technology qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.
♻ ☆ DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers
Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
♻ ☆ An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.
comment: 8 pages, 2 figures
♻ ☆ Qini Curve Estimation under Clustered Network Interference
Qini curves are a widely used tool for assessing treatment policies under allocation constraints as they visualize the incremental gain of a new treatment policy versus the cost of its implementation. Standard Qini curve estimation assumes no interference between units: that is, that treating one unit does not influence the outcome of any other unit. In many real-life applications such as public policy or marketing, however, the presence of interference is common. Ignoring interference in these scenarios can lead to systematically biased Qini curves that over- or under-estimate a treatment policy's cost-effectiveness. In this paper, we address the problem of Qini curve estimation under clustered network interference, where interfering units form independent clusters. We propose a formal description of the problem setting with an experimental study design under which we can account for clustered network interference. Within this framework, we describe three estimation strategies, each suited to different conditions, and provide guidance for selecting the most appropriate approach by highlighting the inherent bias-variance trade-offs. To complement our theoretical analysis, we introduce a marketplace simulator that replicates clustered network interference in a typical e-commerce environment, allowing us to evaluate and compare the proposed strategies in practice.
♻ ☆ Geometry-Aware Edge Pooling for Graph Neural Networks NeurIPS
Graph Neural Networks (GNNs) have shown significant success for graph-based tasks. Motivated by the prevalence of large datasets in real-world applications, pooling layers are crucial components of GNNs. By reducing the size of input graphs, pooling enables faster training and potentially better generalisation. However, existing pooling operations often optimise for the learning task at the expense of discarding fundamental graph structures, thus reducing interpretability. This leads to unreliable performance across dataset types, downstream tasks and pooling ratios. Addressing these concerns, we propose novel graph pooling layers for structure-aware pooling via edge collapses. Our methods leverage diffusion geometry and iteratively reduce a graph's size while preserving both its metric structure and its structural diversity. We guide pooling using magnitude, an isometry-invariant diversity measure, which permits us to control the fidelity of the pooling process. Further, we use the spread of a metric space as a faster and more stable alternative ensuring computational efficiency. Empirical results demonstrate that our methods (i) achieve top performance compared to alternative pooling layers across a range of diverse graph classification tasks, (ii) preserve key spectral properties of the input graphs, and (iii) retain high accuracy across varying pooling ratios.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS) 2025. Our code is available at https://github.com/aidos-lab/mag_edge_pool
♻ ☆ Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
♻ ☆ Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
♻ ☆ Towards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI
Explainable AI (XAI) aims to make AI systems more transparent, yet many practices emphasise mathematical rigour over practical user needs. We propose an alternative to this model-centric approach by following a design thinking process for the emerging XAI field of training data attribution (TDA), which risks repeating solutionist patterns seen in other subfields. However, because TDA is in its early stages, there is a valuable opportunity to shape its direction through user-centred practices. We engage directly with machine learning developers via a needfinding interview study (N=6) and a scenario-based interactive user study (N=31) to ground explanations in real workflows. Our exploration of the TDA design space reveals novel tasks for data-centric explanations useful to developers, such as grouping training samples behind specific model behaviours or identifying undersampled data. We invite the TDA, XAI, and HCI communities to engage with these tasks to strengthen their research's practical relevance and human impact.
♻ ☆ SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.
♻ ☆ Manifold Learning for Hyperspectral Images
Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
♻ ☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
♻ ☆ Towards a Generalizable AI for Materials Discovery: Validation through Immersion Coolant Screening
Artificial intelligence (AI) has emerged as a powerful accelerator of materials discovery, yet most existing models remain problem-specific, requiring additional data collection and retraining for each new property. Here we introduce and validate GATE (Geometrically Aligned Transfer Encoder) -- a generalizable AI framework that jointly learns 34 physicochemical properties spanning thermal, electrical, mechanical, and optical domains. By aligning these properties within a shared geometric space, GATE captures cross-property correlations that reduce disjoint-property bias -- a key factor causing false positives in multi-criteria screening. To demonstrate its generalizable utility, GATE -- without any problem-specific model reconfiguration -- applied to the discovery of immersion cooling fluids for data centers, a stringent real-world challenge defined by the Open Compute Project (OCP). Screening billions of candidates, GATE identified 92,861 molecules as promising for practical deployment. Four were experimentally or literarily validated, showing strong agreement with wet-lab measurements and performance comparable to or exceeding a commercial coolant. These results establish GATE as a generalizable AI platform readily applicable across diverse materials discovery tasks.
comment: 16 pages, 4 figures
♻ ☆ Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
♻ ☆ Offline Clustering of Preference Learning with Active-data Augmentation
Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
♻ ☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
♻ ☆ LLM Based Long Code Translation using Identifier Replacement
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
♻ ☆ Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning
We introduce a novel topology, called Kernel Mean Embedding Topology, for stochastic kernels, in a weak and strong form. This topology, defined on the spaces of Bochner integrable functions from a signal space to a space of probability measures endowed with a Hilbert space structure, allows for a versatile formulation. This construction allows one to obtain both a strong and weak formulation. (i) For its weak formulation, we highlight the utility on relaxed policy spaces, and investigate connections with the Young narrow topology and Borkar (or \( w^* \))-topology, and establish equivalence properties. We report that, while both the \( w^* \)-topology and kernel mean embedding topology are relatively compact, they are not closed. Conversely, while the Young narrow topology is closed, it lacks relative compactness. (ii) We show that the strong form provides an appropriate formulation for placing topologies on spaces of models characterized by stochastic kernels with explicit robustness and learning theoretic implications on optimal stochastic control under discounted or average cost criteria. (iii) We thus show that this topology possesses several properties making it ideal to study optimality and approximations (under the weak formulation) and robustness (under the strong formulation) for many applications.
comment: 37 pages
♻ ☆ LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.
comment: 15 pages, 9 figures
♻ ☆ A Robust and Non-Iterative Tensor Decomposition Method with Automatic Thresholding
Recent advances in IoT and biometric sensing technologies have led to the generation of massive and high-dimensional tensor data, yet achieving accurate and efficient low-rank approximation remains a major challenge. Most existing tensor decomposition methods require predefined ranks and iterative optimization, resulting in high computational costs and dependence on analyst expertise. This study proposes a novel tensor low-rank approximation method that eliminates both prior rank specification and iterative optimization. The method applies statistical singular value hard thresholding to each mode-wise unfolded matrix to automatically extract statistically significant components, effectively reducing noise while preserving the intrinsic structure. Theoretically, the optimal thresholds for each mode are derived from the asymptotic properties of the Marcenko-Pastur distribution. Simulation experiments demonstrate that the proposed method outperforms conventional approaches (HOSVD, HOOI, and Tucker-L2E) in both estimation accuracy and computational efficiency. These results indicate that the proposed approach provides a theoretically grounded, fully automatic, and non-iterative framework for tensor decomposition.
comment: 26 pages, 3 figures
♻ ☆ Accelerating Data Generation for Nonlinear temporal PDEs via homologous perturbation in solution space
Data-driven deep learning methods like neural operators have advanced in solving nonlinear temporal partial differential equations (PDEs). However, these methods require large quantities of solution pairs\u2014the solution functions and right-hand sides (RHS) of the equations. These pairs are typically generated via traditional numerical methods, which need thousands of time steps iterations far more than the dozens required for training, creating heavy computational and temporal overheads. To address these challenges, we propose a novel data generation algorithm, called HOmologous Perturbation in Solution Space (HOPSS), which directly generates training datasets with fewer time steps rather than following the traditional approach of generating large time steps datasets. This algorithm simultaneously accelerates dataset generation and preserves the approximate precision required for model training. Specifically, we first obtain a set of base solution functions from a reliable solver, usually with thousands of time steps, and then align them in time steps with training datasets by downsampling. Subsequently, we propose a "homologous perturbation" approach: by combining two solution functions (one as the primary function, the other as a homologous perturbation term scaled by a small scalar) with random noise, we efficiently generate comparable-precision PDE data points. Finally, using these data points, we compute the variation in the original equation's RHS to form new solution pairs. Theoretical and experimental results show HOPSS lowers time complexity. For example, on the Navier-Stokes equation, it generates 10,000 samples in approximately 10% of traditional methods' time, with comparable model training performance.
♻ ☆ Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training
Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.
♻ ☆ Querying functional and structural niches on spatial transcriptomics data
Cells in multicellular organisms coordinate to form functional and structural niches. With spatial transcriptomics enabling gene expression profiling in spatial contexts, it has been revealed that spatial niches serve as cohesive and recurrent units in physiological and pathological processes. These observations suggest universal tissue organization principles encoded by conserved niche patterns, and call for a query-based niche analytical paradigm beyond current computational tools. In this work, we defined the Niche Query Task, which is to identify similar niches across ST samples given a niche of interest (NOI). We further developed QueST, a specialized method for solving this task. QueST models each niche as a subgraph, uses contrastive learning to learn discriminative niche embeddings, and incorporates adversarial training to mitigate batch effects. In simulations and benchmark datasets, QueST outperformed existing methods repurposed for niche querying, accurately capturing niche structures in heterogeneous environments and demonstrating strong generalizability across diverse sequencing platforms. Applied to tertiary lymphoid structures in renal and lung cancers, QueST revealed functionally distinct niches associated with patient prognosis and uncovered conserved and divergent spatial architectures across cancer types. These results demonstrate that QueST enables systematic, quantitative profiling of spatial niches across samples, providing a powerful tool to dissect spatial tissue architecture in health and disease.
♻ ☆ A Framework for Objective-Driven Dynamical Stochastic Fields
Fields offer a versatile approach for describing complex systems composed of interacting and dynamic components. In particular, some of these dynamical and stochastic systems may exhibit goal-directed behaviors aimed at achieving specific objectives, which we refer to as $\textit{intelligent fields}$. However, due to their inherent complexity, it remains challenging to develop a formal theoretical description of such systems and to effectively translate these descriptions into practical applications. In this paper, we propose three fundamental principles to establish a theoretical framework for understanding intelligent fields: complete configuration, locality, and purposefulness. Moreover, we explore methodologies for designing such fields from the perspective of artificial intelligence applications. This initial investigation aims to lay the groundwork for future theoretical developments and practical advances in understanding and harnessing the potential of such objective-driven dynamical stochastic fields.
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Recent advances in transformer architectures deeply enhanced long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However, HyperAttention fails to find all significant keys, which in turn raises the overall perplexity. We propose a pre-scoring mechanism that prioritizes significant keys before applying HyperAttention. We introduce three scoring methods: $k$-means and kernel $k$-means clustering, $k$-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention's original uniform residual sampling, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces some computational overhead, its combination with HyperAttention achieves up to 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving transformer efficiency.
♻ ☆ Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
♻ ☆ Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.
comment: 23 pages, fixed cleveref-related issue
♻ ☆ Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training
Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.
♻ ☆ Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the sparse target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (LLMs) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of LLMs; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.
♻ ☆ AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy NeurIPS 2025
Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
comment: Accepted at NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ DrivAerStar: An Industrial-Grade CFD Dataset for Vehicle Aerodynamic Optimization
Vehicle aerodynamics optimization has become critical for automotive electrification, where drag reduction directly determines electric vehicle range and energy efficiency. Traditional approaches face an intractable trade-off: computationally expensive Computational Fluid Dynamics (CFD) simulations requiring weeks per design iteration, or simplified models that sacrifice production-grade accuracy. While machine learning offers transformative potential, existing datasets exhibit fundamental limitations -- inadequate mesh resolution, missing vehicle components, and validation errors exceeding 5% -- preventing deployment in industrial workflows. We present DrivAerStar, comprising 12,000 industrial-grade automotive CFD simulations generated using STAR-CCM+${}^\unicode{xAE}$ software. The dataset systematically explores three vehicle configurations through 20 Computer Aided Design (CAD) parameters via Free Form Deformation (FFD) algorithms, including complete engine compartments and cooling systems with realistic internal airflow. DrivAerStar achieves wind tunnel validation accuracy below 1.04% -- a five-fold improvement over existing datasets -- through refined mesh strategies with strict wall $y^+$ control. Benchmarks demonstrate that models trained on this data achieve production-ready accuracy while reducing computational costs from weeks to minutes. This represents the first dataset bridging academic machine learning research and industrial CFD practice, establishing a new standard for data-driven aerodynamic optimization in automotive development. Beyond automotive applications, DrivAerStar demonstrates a paradigm for integrating high-fidelity physics simulations with Artificial Intelligence (AI) across engineering disciplines where computational constraints currently limit innovation.
♻ ☆ Don't throw the baby out with the bathwater: How and why deep learning for ARC
The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.
comment: 13 pages, 6 figures
♻ ☆ Training a Generally Curious Agent ICML 2025
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
comment: ICML 2025. Project Website: https://paprika-llm.github.io
♻ ☆ Scaling Diffusion Transformers Efficiently via $μ$P NeurIPS 2025
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
comment: Accepted by NeurIPS 2025, 38 pages, 10 figures, 17 tables
♻ ☆ CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices NeurIPS 2025
Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.
comment: Accepted at NeurIPS 2025. Camera-ready version. 10 pages, 12 figures, 2 tables
♻ ☆ SVTime: Small Time Series Forecasting Models Informed by "Physics" of Large Vision Model Forecasters
Time series AI is crucial for analyzing dynamic web content, driving a surge of pre-trained large models known for their strong knowledge encoding and transfer capabilities across diverse tasks. However, given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability. For a specific task, a compact yet specialized, high-performing model may be more practical and affordable, especially for resource-constrained users such as small businesses. This motivates the question: Can we build cost-effective lightweight models with large-model-like performance on core tasks such as forecasting? This paper addresses this question by introducing SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF). Recently, LVMs have been shown as powerful tools for LTSF. We identify a set of key inductive biases of LVM forecasters -- analogous to the "physics" governing their behaviors in LTSF -- and design small models that encode these biases through meticulously crafted linear layers and constraint functions. Across 21 baselines spanning lightweight, complex, and pre-trained large models on 8 benchmark datasets, SVTime outperforms state-of-the-art (SOTA) lightweight models and rivals large models with 10^3 fewer parameters than LVMs, while enabling efficient training and inference in low-resource settings.
♻ ☆ SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks
The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, SELU) using a differentiable hard-mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of varying depths. The analysis shows that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures.
♻ ☆ SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
♻ ☆ Benchmarking Ultra-Low-Power $μ$NPUs
Efficient on-device neural network (NN) inference offers predictable latency, improved privacy and reliability, and lower operating costs for vendors than cloud-based inference. This has sparked recent development of microcontroller-scale NN accelerators, also known as neural processing units ($\mu$NPUs), designed specifically for ultra-low-power applications. We present the first comparative evaluation of a number of commercially-available $\mu$NPUs, including the first independent benchmarks for multiple platforms. To ensure fairness, we develop and open-source a model compilation pipeline supporting consistent benchmarking of quantized models across diverse microcontroller hardware. Our resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including certain $\mu$NPUs exhibiting unexpected scaling behaviors with model complexity. This work provides a foundation for ongoing evaluation of $\mu$NPU platforms, alongside offering practical insights for both hardware and software developers in this rapidly evolving space.
♻ ☆ On-device Computation of Single-lead ECG Parameters for Real-time Remote Cardiac Health Assessment: A Real-world Validation Study
Accurate, continuous out-of-hospital electrocardiogram (ECG) parameter measurement is vital for real-time cardiac health monitoring and telemedicine. On-device computation of single-lead ECG parameters enables timely assessment without reliance on centralized data processing, advancing personalized, ubiquitous cardiac care-yet comprehensive validation across heterogeneous real-world populations remains limited. This study validated the on-device algorithm FeatureDB (https://github.com/PKUDigitalHealth/FeatureDB) using two datasets: HeartVoice-ECG-lite (369 participants with single-lead ECGs annotated by two physicians) and PTB-XL/PTB-XL+ (21,354 patients with 12-lead ECGs and physicians' diagnostic annotations). FeatureDB computed PR, QT, and QTc intervals, with accuracy evaluated against physician annotations via mean absolute error (MAE), correlation analysis, and Bland-Altman analysis. Diagnostic performance for first-degree atrioventricular block (AVBI, PR-based) and long QT syndrome (LQT, QTc-based) was benchmarked against commercial 12-lead systems (12SL, Uni-G) and open-source algorithm Deli, using AUC, accuracy, sensitivity, and specificity. Results showed high concordance with expert annotations (Pearson correlations: 0.836-0.960), MAEs matching inter-observer variability, and minimal bias. AVBI AUC reached 0.787 (12SL: 0.859; Uni-G: 0.812; Deli: 0.501); LQT AUC was 0.684 (12SL: 0.716; Uni-G: 0.605; Deli: 0.569)-comparable to commercial tools and superior to open-source alternatives. FeatureDB delivers physician-level parameter accuracy and commercial-grade abnormality detection via single-lead devices, supporting scalable telemedicine, decentralized cardiac screening, and continuous monitoring in community and outpatient settings.
♻ ☆ A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations
Ensemble simulations of high-dimensional flow models (e.g., Navier Stokes type PDEs) are computationally prohibitive for real time applications. Neural operators enable fast inference but are limited by costly data requirements and poor generalization to 3D flows. We present a data-free operator network for the Navier Stokes equations that eliminates the need for paired solution data and enables robust, real time inference for large ensemble forecasting. The physics-grounded architecture takes initial and boundary conditions as well as forcing functions, yielding solutions robust to high variability and perturbations. Across 2D benchmarks and 3D test cases, the method surpasses prior neural operators in accuracy and, for ensembles, achieves greater efficiency than conventional numerical solvers. Notably, it delivers accurate solutions of the three dimensional Navier Stokes equations, a regime not previously demonstrated for data free neural operators. By uniting a numerically grounded architecture with the scalability of machine learning, this approach establishes a practical pathway toward data free, high fidelity PDE surrogates for end to end scientific simulation and prediction.
♻ ☆ Adversarially robust clustering with optimality guarantees
We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated and public datasets are conducted to support the theoretical guarantees of our method.
comment: 38 pages, 10 figures. Version accepted at the IEEE Transactions on Information Theory
♻ ☆ Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
♻ ☆ Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.
comment: Updated manuscript. Added new experiments, figures, and exposition
♻ ☆ Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting
Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.
♻ ☆ Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry
Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals - fully accessible by operators. Using low-level signals collected from the system, Reveal detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using Reveal, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
comment: 12 pages, 9 figures, submitted to nsdi 26
♻ ☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
♻ ☆ Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms
When online sellers use AI learning algorithms to automatically compete on e-commerce platforms, there is concern that they will learn to coordinate on higher than competitive prices. However, this concern was primarily raised in single-dimension price competition. We investigate whether this prediction holds when sellers make pricing and advertising decisions together, i.e., two-dimensional decisions. We analyze competition in multi-agent reinforcement learning, and use a large-scale dataset from Amazon.com to provide empirical evidence. We show that when consumers have high search costs, learning algorithms can coordinate on prices lower than competitive prices, facilitating a win-win-win for consumers, sellers, and platforms. This occurs because algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices and enlarging demand on the platform. We also show that our results generalize to any learning algorithm that uses exploration of price and advertising bids. Consistent with our predictions, an empirical analysis shows that price levels exhibit a negative interaction between estimated consumer search costs and algorithm usage index. We analyze the platform's strategic response and find that reserve price adjustments will not increase platform profits, but commission adjustments will, while maintaining the beneficial outcomes for both sellers and consumers.
♻ ☆ Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Semantic interpretability in Reinforcement Learning (RL) enables transparency and verifiability of decision-making. Achieving semantic interpretability in reinforcement learning requires (1) a feature space composed of human-understandable concepts and (2) a policy that is interpretable and verifiable. However, constructing such a feature space has traditionally relied on manual human specification, which often fails to generalize to unseen environments. Moreover, even when interpretable features are available, most reinforcement learning algorithms employ black-box models as policies, thereby hindering transparency. We introduce interpretable Tree-based Reinforcement learning via Automated Concept Extraction (iTRACE), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and train a interpretable tree-based model via RL. To address the impracticality of running VLMs in RL loops, we distill their outputs into a lightweight model. By leveraging Vision-Language Models (VLMs) to automate tree-based reinforcement learning, iTRACE loosens the reliance the need for human annotation that is traditionally required by interpretable models. In addition, it addresses key limitations of VLMs alone, such as their lack of grounding in action spaces and their inability to directly optimize policies. We evaluate iTRACE across three domains: Atari games, grid-world navigation, and driving. The results show that iTRACE outperforms other interpretable policy baselines and matches the performance of black-box policies on the same interpretable feature space.
♻ ☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
♻ ☆ Swing-by Dynamics in Concept Learning and Compositional Generalization
Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM's learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights -- e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.
♻ ☆ RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm
Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA's efficacy in balancing efficiency and accuracy. The code is publicly available at https://github.com/FFTYYY/RaanA .
♻ ☆ PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design
Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.
♻ ☆ SpikeFit: Towards Optimal Deployment of Spiking Networks on Neuromorphic Hardware
This paper introduces SpikeFit, a novel training method for Spiking Neural Networks (SNNs) that enables efficient inference on neuromorphic hardware, considering all its stringent requirements: the number of neurons and synapses that can fit on a single device, and lower bit-width representations (e.g., 4-bit, 8-bit). Unlike conventional compressing approaches that address only a subset of these requirements (limited numerical precision and limited number of neurons in the network), SpikeFit treats the allowed weights' discrete values themselves as learnable parameters co-optimized with the model, allowing for optimal Clusterization-Aware Training (CAT) of the model's weights at low precision (2-, 4-, or 8-bit) which results in higher network compression efficiency, as well as limiting the number of unique synaptic connections to a value required by neuromorphic processor. This joint optimization allows SpikeFit to find a discrete weight set aligned with hardware constraints, enabling the most complete deployment across a broader range of neuromorphic processors than existing methods of SNN compression support. Moreover, SpikeFit introduces a new hardware-friendly Fisher Spike Contribution (FSC) pruning method showing the state-of-the-art performance. We demonstrate that for spiking neural networks constrained to only four unique synaptic weight values (M = 4), our SpikeFit method not only outperforms state-of-the-art SNNs compression methods and conventional baselines combining extreme quantization schemes and clustering algorithms, but also meets a wider range of neuromorphic hardware requirements and provides the lowest energy use in experiments.
comment: Published at the European Conference on Neural Information Processing Systems (EurIPS 2025). 13 pages, 2 figures
♻ ☆ Context Tuning for In-Context Optimization ICML 2025
We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for LLMs, they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
comment: A short version of this paper was accepted at ICML 2025 Workshop on Test-Time Adaptation
♻ ☆ High Resolution Seismic Waveform Generation using Denoising Diffusion
Accurate prediction and synthesis of seismic waveforms are crucial for seismic-hazard assessment and earthquake-resistant infrastructure design. Existing prediction methods, such as ground-motion models and physics-based wave-field simulations, often fail to capture the full complexity of seismic wavefields, particularly at higher frequencies. This study introduces HighFEM, a novel, computationally efficient, and scalable (i.e., capable of generating many seismograms simultaneously) generative model for high-frequency seismic-waveform generation. Our approach leverages a spectrogram representation of the seismic-waveform data, which is reduced to a lower-dimensional manifold via an autoencoder. A state-of-the-art diffusion model is trained to generate this latent representation conditioned on key input parameters: earthquake magnitude, recording distance, site conditions, hypocenter depth, and azimuthal gap. The model generates waveforms with frequency content up to 50 Hz. Any scalar ground-motion statistic, such as peak ground-motion amplitudes and spectral accelerations, can be readily derived from the synthesized waveforms. We validate our model using commonly employed seismological metrics and performance metrics from image-generation studies. Our results demonstrate that the openly available model can generate realistic high-frequency seismic waveforms across a wide range of input parameters, even in data-sparse regions. For the scalar ground-motion statistics commonly used in seismic-hazard and earthquake-engineering studies, we show that our model accurately reproduces both the median trends of the real data and their variability. To evaluate and compare the growing number of these and similar Generative Waveform Models (GWMs), we argue that they should be openly available and included in community ground-motion-model evaluation efforts.
♻ ☆ LPAC: Learnable Perception-Action-Communication Loops with Applications to Coverage Control
Coverage control is the problem of navigating a robot swarm to collaboratively monitor features or a phenomenon of interest not known a priori. The problem is challenging in decentralized settings with robots that have limited communication and sensing capabilities. We propose a learnable Perception-Action-Communication (LPAC) architecture for the problem, wherein a convolutional neural network (CNN) processes localized perception; a graph neural network (GNN) facilitates robot communications; finally, a shallow multi-layer perceptron (MLP) computes robot actions. The GNN enables collaboration in the robot swarm by computing what information to communicate with nearby robots and how to incorporate received information. Evaluations show that the LPAC models -- trained using imitation learning -- outperform standard decentralized and centralized coverage control algorithms. The learned policy generalizes to environments different from the training dataset, transfers to larger environments with more robots, and is robust to noisy position estimates. The results indicate the suitability of LPAC architectures for decentralized navigation in robot swarms to achieve collaborative behavior.
comment: 20 Pages, 20 figures,
♻ ☆ LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
♻ ☆ Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research NeurIPS 2025
"Machine unlearning" is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model's parameters, e.g., a particular individual's personal data or the inclusion of copyrighted content in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.
comment: NeurIPS 2025 (Oral)
♻ ☆ A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models
Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning? In this work, we investigate this question through a theoretical analysis of the loss landscape geometry. We focus on networks built using permutation representations, which we can view as a subset of unconstrained MLPs. Importantly, we show that the parameter symmetries of the unconstrained model has nontrivial effects on the loss landscape of the equivariant subspace and under certain conditions can provably prevent learning of the global minima. Further, we empirically demonstrate in such cases, relaxing to an unconstrained MLP can sometimes solve the issue. Interestingly, the weights eventually found via relaxation corresponds to a different choice of group representation in the hidden layer. From this, we draw 3 key takeaways. (1) By viewing the unconstrained version of an architecture, we can uncover hidden parameter symmetries which were broken by choice of constraint enforcement (2) Hidden symmetries give important insights on loss landscapes and can induce critical points and even minima (3) Hidden symmetry induced minima can sometimes be escaped by constraint relaxation and we observe the network jumps to a different choice of constraint enforcement. Effective equivariance relaxation may require rethinking the fixed choice of group representation in the hidden layers.
♻ ☆ Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT
This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy. Relevant source code is provided in the accompanying GitHub repository\footnote{https://github.com/tianheng-ling/TinyTransformer4TS}.
comment: Accepted by 2024 IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT) and got best paper award. 7 pages, 3 figures, 4 tables
♻ ☆ Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation
The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.
♻ ☆ Learning Diffusion Priors from Observations by Expectation Maximization
Diffusion models recently proved to be remarkable priors for Bayesian inverse problems. However, training these models typically requires access to large amounts of clean data, which could prove difficult in some settings. In this work, we present DiEM, a novel method based on the expectation-maximization algorithm for training diffusion models from incomplete and noisy observations only. Unlike previous works, DiEM leads to proper diffusion models, which is crucial for downstream tasks. As part of our methods, we propose and motivate an improved posterior sampling scheme for unconditional diffusion models. We present empirical evidence supporting the effectiveness of our approach.
♻ ☆ Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning
Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantees of Byz-DASHA-PAGE under the standard $(G,B)$-gradient dissimilarity model, while relying on milder assumptions and requiring less memory and communication per client. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.
comment: 32 pages
♻ ☆ Model Alignment Search
When can we say that two neural systems perform a task in the same way? What nuances do we miss when we fail to causally probe the representations of the systems, and how do we establish bidirectional causal relationships? In this work, we introduce a method that bidirectionally transfers neural activity between artificial neural networks and uses their resulting behavior as a measure of functional similarity. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information demonstrating that numbers can be encoded differently in recurrent models depending on the task, and we present another case study showing that MAS can reveal misalignment in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we augment the loss function with a counterfactual latent (CL) auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.
♻ ☆ Efficient Neural SDE Training using Wiener-Space Cubature
A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the gradient expectation. In this work we introduce a novel training technique which bypasses and improves upon this Monte-Carlo simulation; we extend results in the theory of Wiener space cubature to approximate the expected objective functional value by a weighted sum of functional evaluations of deterministic ODE solutions. Our main mathematical contribution enabling this approximation is an extension of cubature bounds to the setting of Lipschitz-nonlinear functionals acting on path-space. Our resulting constructive algorithm allows for more computationally efficient training along several lines. First, it circumvents Brownian motion simulation and enables the use of efficient parallel ODE solvers, thus decreasing the complexity of path-functional evaluation. Furthermore, and more surprisingly, we show that the number of paths required to achieve a given (expected loss functional oracle value) approximation can be reduced in this deterministic cubature regime. Specifically, we show that under reasonable regularity assumptions we can observe a O(1/n) convergence rate, where n is the number of path evaluations; in contrast with the standard O(1/sqrt(n)) rate of naive Monte-Carlo or the O(log(n)^d /n) rate of quasi-Monte-Carlo.
♻ ☆ Evaluating Simplification Algorithms for Interpretability of Time Series Classification
In this work, we introduce metrics to evaluate the use of simplified time series in the context of interpretability of a TSC - a Time Series Classifier. Such simplifications are important because time series data, in contrast to text and image data, are not intuitively under- standable to humans. These metrics are related to the complexity of the simplifications - how many segments they contain - and to their loyalty - how likely they are to maintain the classification of the original time series. We focus on simplifications that select a subset of the original data points, and show that these typically have high Shapley value, thereby aiding interpretability. We employ these metrics to experimentally evaluate four distinct simplification algorithms, across several TSC algorithms and across datasets of varying characteristics, from seasonal or stationary to short or long. We subsequently perform a human-grounded evaluation with forward simulation, that confirms also the practical utility of the introduced metrics to evaluate the use of simplifications in the context of inter- pretability of TSC. Our findings are summarized in a framework for deciding, for a given TSC, if the various simplifications are likely to aid in its interpretability.
♻ ☆ Machine Learning Algorithms for Improving Exact Classical Solvers in Mixed Integer Continuous Optimization
Integer and mixed-integer nonlinear programming (INLP, MINLP) are central to logistics, energy, and scheduling, but remain computationally challenging. This survey examines how machine learning and reinforcement learning can enhance exact optimization methods-particularly branch-and-bound (BB)-without compromising global optimality. We cover discrete, continuous, and mixed-integer formulations, and highlight applications such as vehicle routing, hydropower planning, and crew scheduling. We introduce a unified BB framework that embeds learning-based strategies into branching, cut selection, node ordering, and parameter control. Classical algorithms are augmented using supervised, imitation, and reinforcement learning models to accelerate convergence while maintaining correctness. We conclude with a taxonomy of learning methods by solver class and learning paradigm, and outline open challenges in generalization, hybridization, and scaling intelligent solvers.
♻ ☆ ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning NeurIPS 2025
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
comment: NeurIPS 2025
♻ ☆ AI-Driven Detection and Analysis of Handwriting on Seized Ivory: A Tool to Uncover Criminal Networks in the Illicit Wildlife Trade
The transnational ivory trade continues to drive the decline of elephant populations across Africa, and trafficking networks remain difficult to disrupt. Tusks seized by law enforcement officials carry forensic information on the traffickers responsible for their export, including DNA evidence and handwritten markings made by traffickers. For 20 years, analyses of tusk DNA have identified where elephants were poached and established connections among shipments of ivory. While the links established using genetic evidence are extremely conclusive, genetic data is expensive and sometimes impossible to obtain. But though handwritten markings are easy to photograph, they are rarely documented or analyzed. Here, we present an AI-driven pipeline for extracting and analyzing handwritten markings on seized elephant tusks, offering a novel, scalable, and low-cost source of forensic evidence. Having collected 6,085 photographs from eight large seizures of ivory over a 6-year period (2014-2019), we used an object detection model to extract over 17,000 individual markings, which were then labeled and described using state-of-the-art AI tools. We identified 184 recurring "signature markings" that connect the tusks on which they appear. 20 signature markings were observed in multiple seizures, establishing forensic links between these seizures through traffickers involved in both shipments. This work complements other investigative techniques by filling in gaps where other data sources are unavailable. The study demonstrates the transformative potential of AI in wildlife forensics and highlights practical steps for integrating handwriting analysis into efforts to disrupt organized wildlife crime.
comment: Submitted. 5 pages, 3 figures, 1 table
♻ ☆ AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
comment: 10 pages, 6 figures, submitted to Ninth Annual Conference on Machine Learning and Systems (MLSys'26)
♻ ☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
♻ ☆ Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.
♻ ☆ A Genealogy of Foundation Models in Remote Sensing
Foundation models have garnered increasing attention for representation learning in remote sensing. Many such foundation models adopt approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches for how to most effectively leverage remotely sensed data. This paper examines these approaches, along with their roots in the computer vision field. This is done to characterize potential advantages and pitfalls, while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We first examine single-sensor remote foundation models to introduce concepts and provide context, and then place emphasis on incorporating the multi-sensor aspect of Earth observations into foundation models. In particular, we explore the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
comment: 28 pages, submitted to ACM SigSpatial, currently under peer review. This version is revised from the original 20 page manuscript
Graphics 3
☆ Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds
The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
☆ Object-Aware 4D Human Motion Generation
Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.
♻ ☆ C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing ICCV 2025
Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.
comment: ICCV 2025 Workshop Wild3D
Robotics 57
☆ Running VLAs at Real-time Speed
In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.
comment: Code is available at https://github.com/Dexmal/realtime-vla
☆ Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation
In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.
☆ Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
☆ Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.
comment: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror (for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation
☆ REALMS2 - Resilient Exploration And Lunar Mapping System 2 - A Comprehensive Approach IROS 2025
The European Space Agency (ESA) and the European Space Resources Innovation Centre (ESRIC) created the Space Resources Challenge to invite researchers and companies to propose innovative solutions for Multi-Robot Systems (MRS) space prospection. This paper proposes the Resilient Exploration And Lunar Mapping System 2 (REALMS2), a MRS framework for planetary prospection and mapping. Based on Robot Operating System version 2 (ROS 2) and enhanced with Visual Simultaneous Localisation And Mapping (vSLAM) for map generation, REALMS2 uses a mesh network for a robust ad hoc network. A single graphical user interface (GUI) controls all the rovers, providing a simple overview of the robotic mission. This system is designed for heterogeneous multi-robot exploratory missions, tackling the challenges presented by extraterrestrial environments. REALMS2 was used during the second field test of the ESA-ESRIC Challenge and allowed to map around 60% of the area, using three homogeneous rovers while handling communication delays and blackouts.
comment: 8 Pages, 8 Figures, Submitted and Accepted to IROS 2025
☆ A Sliding-Window Filter for Online Continuous-Time Continuum Robot State Estimation
Stochastic state estimation methods for continuum robots (CRs) often struggle to balance accuracy and computational efficiency. While several recent works have explored sliding-window formulations for CRs, these methods are limited to simplified, discrete-time approximations and do not provide stochastic representations. In contrast, current stochastic filter methods must run at the speed of measurements, limiting their full potential. Recent works in continuous-time estimation techniques for CRs show a principled approach to addressing this runtime constraint, but are currently restricted to offline operation. In this work, we present a sliding-window filter (SWF) for continuous-time state estimation of CRs that improves upon the accuracy of a filter approach while enabling continuous-time methods to operate online, all while running at faster-than-real-time speeds. This represents the first stochastic SWF specifically designed for CRs, providing a promising direction for future research in this area.
comment: 8 pages, 6 figures. Submitted to IEEE-RAS International Conference on Soft Robotics 2026
☆ Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
☆ FLYINGTRUST: A Benchmark for Quadrotor Navigation Across Scenarios and Vehicles
Visual navigation algorithms for quadrotors often exhibit a large variation in performance when transferred across different vehicle platforms and scene geometries, which increases the cost and risk of field deployment. To support systematic early-stage evaluation, we introduce FLYINGTRUST, a high-fidelity, configurable benchmarking framework that measures how platform kinodynamics and scenario structure jointly affect navigation robustness. FLYINGTRUST models vehicle capability with two compact, physically interpretable indicators: maximum thrust-to-weight ratio and axis-wise maximum angular acceleration. The benchmark pairs a diverse scenario library with a heterogeneous set of real and virtual platforms and prescribes a standardized evaluation protocol together with a composite scoring method that balances scenario importance, platform importance and performance stability. We use FLYINGTRUST to compare representative optimization-based and learning-based navigation approaches under identical conditions, performing repeated trials per platform-scenario combination and reporting uncertainty-aware metrics. The results reveal systematic patterns: navigation success depends predictably on platform capability and scene geometry, and different algorithms exhibit distinct preferences and failure modes across the evaluated conditions. These observations highlight the practical necessity of incorporating both platform capability and scenario structure into algorithm design, evaluation, and selection, and they motivate future work on methods that remain robust across diverse platforms and scenarios.
☆ Proxemics and Permeability of the Pedestrian Group
People tend to walk in groups, and interactions with those groups have a significant impact on crowd behavior and pedestrian traffic dynamics. Social norms can be seen as unwritten rules regulating people interactions in social settings. This article studies people interactions with groups and the emergence of group proxemics. Group zones, zone occupancy counts and people clearance from the group are studied using naturalistic data. Analysis indicate potential presence of three different zones in addition to the public zone. People tend to remain in the public zone and only progressively get closer to groups, and those closer approaches happen in a low frequency and for brief periods of time.
☆ Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
Conventional robots possess a limited understanding of their kinematics and are confined to preprogrammed tasks, hindering their ability to leverage tools efficiently. Driven by the essential components of tool usage - grasping the desired outcome, selecting the most suitable tool, determining optimal tool orientation, and executing precise manipulations - we introduce a pioneering framework. Our novel approach expands the capabilities of the robot's inverse kinematics solver, empowering it to acquire a sequential repertoire of actions using tools of varying lengths. By integrating a simulation-learned action trajectory with the tool, we showcase the practicality of transferring acquired skills from simulation to real-world scenarios through comprehensive experimentation. Remarkably, our extended inverse kinematics solver demonstrates an impressive error rate of less than 1 cm. Furthermore, our trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our model achieves virtually indistinguishable performance when employing two distinct tools of different lengths. This research provides an indication of potential advances in the exploration of all four fundamental aspects of tool usage, enabling robots to master the intricate art of tool manipulation across diverse tasks.
comment: 10 pages, 5 figures. Demonstrates a reinforcement learning framework for adaptive tool manipulation with variable-length extensions
☆ RoboOS-NeXT: A Unified Memory-based Framework for Lifelong, Scalable, and Robust Multi-Robot Collaboration
The proliferation of collaborative robots across diverse tasks and embodiments presents a central challenge: achieving lifelong adaptability, scalable coordination, and robust scheduling in multi-agent systems. Existing approaches, from vision-language-action (VLA) models to hierarchical frameworks, fall short due to their reliance on limited or dividual-agent memory. This fundamentally constrains their ability to learn over long horizons, scale to heterogeneous teams, or recover from failures, highlighting the need for a unified memory representation. To address these limitations, we introduce RoboOS-NeXT, a unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration. At the core of RoboOS-NeXT is the novel Spatio-Temporal-Embodiment Memory (STEM), which integrates spatial scene geometry, temporal event history, and embodiment profiles into a shared representation. This memory-centric design is integrated into a brain-cerebellum framework, where a high-level brain model performs global planning by retrieving and updating STEM, while low-level controllers execute actions locally. This closed loop between cognition, memory, and execution enables dynamic task allocation, fault-tolerant collaboration, and consistent state synchronization. We conduct extensive experiments spanning complex coordination tasks in restaurants, supermarkets, and households. Our results demonstrate that RoboOS-NeXT achieves superior performance across heterogeneous embodiments, validating its effectiveness in enabling lifelong, scalable, and robust multi-robot collaboration. Project website: https://flagopen.github.io/RoboOS/
☆ Efficient Collision-Avoidance Constraints for Ellipsoidal Obstacles in Optimal Control: Application to Path-Following MPC and UAVs
This article proposes a modular optimal control framework for local three-dimensional ellipsoidal obstacle avoidance, exemplarily applied to model predictive path-following control. Static as well as moving obstacles are considered. Central to the approach is a computationally efficient and continuously differentiable condition for detecting collisions with ellipsoidal obstacles. A novel two-stage optimization approach mitigates numerical issues arising from the structure of the resulting optimal control problem. The effectiveness of the approach is demonstrated through simulations and real-world experiments with the Crazyflie quadrotor. This represents the first hardware demonstration of an MPC controller of this kind for UAVs in a three-dimensional task.
☆ Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
comment: 8 pages
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method's effectiveness for real-world applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ Towards Reinforcement Learning Based Log Loading Automation
Forestry forwarders play a central role in mechanized timber harvesting by picking up and moving logs from the felling site to a processing area or a secondary transport vehicle. Forwarder operation is challenging and physically and mentally exhausting for the operator who must control the machine in remote areas for prolonged periods of time. Therefore, even partial automation of the process may reduce stress on the operator. This study focuses on continuing previous research efforts in application of reinforcement learning agents in automating log handling process, extending the task from grasping which was studied in previous research to full log loading operation. The resulting agent will be capable to automate a full loading procedure from locating and grappling to transporting and delivering the log to a forestry forwarder bed. To train the agent, a trailer type forestry forwarder simulation model in NVIDIA's Isaac Gym and a virtual environment for a typical log loading scenario were developed. With reinforcement learning agents and a curriculum learning approach, the trained agent may be a stepping stone towards application of reinforcement learning agents in automation of the forestry forwarder. The agent learnt grasping a log in a random position from grapple's random position and transport it to the bed with 94% success rate of the best performing agent.
☆ Cooperative Task Spaces for Multi-Arm Manipulation Control based on Similarity Transformations
Many tasks in human environments require collaborative behavior between multiple kinematic chains, either to provide additional support for carrying big and bulky objects or to enable the dexterity that is required for in-hand manipulation. Since these complex systems often have a very high number of degrees of freedom coordinating their movements is notoriously difficult to model. In this article, we present the derivation of the theoretical foundations for cooperative task spaces of multi-arm robotic systems based on geometric primitives defined using conformal geometric algebra. Based on the similarity transformations of these cooperative geometric primitives, we derive an abstraction of complex robotic systems that enables representing these systems in a way that directly corresponds to single-arm systems. By deriving the associated analytic and geometric Jacobian matrices, we then show the straightforward integration of our approach into classical control techniques rooted in operational space control. We demonstrate this using bimanual manipulators, humanoids and multi-fingered hands in optimal control experiments for reaching desired geometric primitives and in teleoperation experiments using differential kinematics control. We then discuss how the geometric primitives naturally embed nullspace structures into the controllers that can be exploited for introducing secondary control objectives. This work, represents the theoretical foundations of this cooperative manipulation control framework, and thus the experiments are presented in an abstract way, while giving pointers towards potential future applications.
☆ AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
Autonomous robots in orchards require real-time 3D scene understanding despite repetitive row geometry, seasonal appearance changes, and wind-driven foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints recovers orchard structure under occlusions, while a unified gradient-driven map lifecycle executed between keyframes preserves fine details and bounds memory. Pose refinement is guided by a probabilistic LiDAR-based depth consistency term, back-propagated through the camera projection to tighten geometry-appearance coupling. We deploy the system on a field platform in apple and pear orchards across dormancy, flowering, and harvesting, using a standardized trajectory protocol that evaluates both training-view and novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while maintaining real-time performance on-tractor. While demonstrated in orchard monitoring, the approach can be applied to other outdoor domains requiring robust multimodal perception.
☆ Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments
Humanoids hold great potential for service, industrial, and rescue applications, in which robots must sustain whole-body stability while performing intense, contact-rich interactions with the environment. However, enabling humanoids to generate human-like, adaptive responses under such conditions remains a major challenge. To address this, we propose Thor, a humanoid framework for human-level whole-body reactions in contact-rich environments. Based on the robot's force analysis, we design a force-adaptive torso-tilt (FAT2) reward function to encourage humanoids to exhibit human-like responses during force-interaction tasks. To mitigate the high-dimensional challenges of humanoid control, Thor introduces a reinforcement learning architecture that decouples the upper body, waist, and lower body. Each component shares global observations of the whole body and jointly updates its parameters. Finally, we deploy Thor on the Unitree G1, and it substantially outperforms baselines in force-interaction tasks. Specifically, the robot achieves a peak pulling force of 167.7 N (approximately 48% of the G1's body weight) when moving backward and 145.5 N when moving forward, representing improvements of 68.9% and 74.7%, respectively, compared with the best-performing baseline. Moreover, Thor is capable of pulling a loaded rack (130 N) and opening a fire door with one hand (60 N). These results highlight Thor's effectiveness in enhancing humanoid force-interaction capabilities.
☆ PHUMA: Physically-Grounded Humanoid Locomotion Dataset
Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.
☆ Self-localization on a 3D map by fusing global and local features from a monocular camera
Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.
☆ Adaptive Trajectory Refinement for Optimization-based Local Planning in Narrow Passages
Trajectory planning for mobile robots in cluttered environments remains a major challenge due to narrow passages, where conventional methods often fail or generate suboptimal paths. To address this issue, we propose the adaptive trajectory refinement algorithm, which consists of two main stages. First, to ensure safety at the path-segment level, a segment-wise conservative collision test is applied, where risk-prone trajectory path segments are recursively subdivided until collision risks are eliminated. Second, to guarantee pose-level safety, pose correction based on penetration direction and line search is applied, ensuring that each pose in the trajectory is collision-free and maximally clear from obstacles. Simulation results demonstrate that the proposed method achieves up to 1.69x higher success rates and up to 3.79x faster planning times than state-of-the-art approaches. Furthermore, real-world experiments confirm that the robot can safely pass through narrow passages while maintaining rapid planning performance.
☆ Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling
Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM guidance.
☆ Embodied Intelligence for Advanced Bioinspired Microrobotics: Examples and Insights
The term embodied intelligence (EI) conveys the notion that body morphology, material properties, interaction with the environment, and control strategies can be purposefully integrated into the process of robotic design to generate intelligent behavior; in particular, locomotion and navigation. In this paper, we discuss EI as a design principle for advanced microrobotics, with a particular focus on co-design -- the simultaneous and interdependent development of physical structure and behavioral function. To illustrate the contrast between EI-inspired systems and traditional architectures that decouple sensing, computation, and actuation, we present and discuss a collection of robots developed by the author and his team at the Autonomous Microrobotic Systems Laboratory (AMSL). These robots exhibit intelligent behavior that emerges from their structural dynamics and the physical interaction between their components and with the environment. Platforms such as the Bee++, RoBeetle, SMALLBug, SMARTI, WaterStrider, VLEIBot+, and FRISSHBot exemplify how feedback loops, decision logics, sensing mechanisms, and smart actuation strategies can be embedded into the physical properties of the robotic system itself. Along these lines, we contend that co-design is not only a method for empirical optimization under constraints, but also an enabler of EI, offering a scalable and robust alternative to classical control for robotics at the mm-to-cm-scale.
comment: 8 pages, 7 figures, accepted to ICAR 2025
☆ Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.
comment: double-column 5 pages, 3 figures
☆ Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse
Robots with anthropomorphic features are increasingly shaping how humans perceive and morally engage with them. Our research investigates how different levels of anthropomorphism influence protective responses to robot abuse, extending the Computers as Social Actors (CASA) and uncanny valley theories into a moral domain. In an experiment, we invite 201 participants to view videos depicting abuse toward a robot with low (Spider), moderate (Two-Foot), or high (Humanoid) anthropomorphism. To provide a comprehensive analysis, we triangulate three modalities: self-report surveys measuring emotions and uncanniness, physiological data from automated facial expression analysis, and qualitative reflections. Findings indicate that protective responses are not linear. The moderately anthropomorphic Two-Foot robot, rated highest in eeriness and "spine-tingling" sensations consistent with the uncanny valley, elicited the strongest physiological anger expressions. Self-reported anger and guilt are significantly higher for both the Two-Foot and Humanoid robots compared to the Spider. Qualitative findings further reveal that as anthropomorphism increases, moral reasoning shifts from technical assessments of property damage to condemnation of the abuser's character, while governance proposals expand from property law to calls for quasi-animal rights and broader societal responsibility. These results suggest that the uncanny valley does not dampen moral concern but paradoxically heightens protective impulses, offering critical implications for robot design, policy, and future legal frameworks.
☆ I don't Want You to Die: A Shared Responsibility Framework for Safeguarding Child-Robot Companionship
Social robots like Moxie are designed to form strong emotional bonds with children, but their abrupt discontinuation can cause significant struggles and distress to children. When these services end, the resulting harm raises complex questions of who bears responsibility when children's emotional bonds are broken. Using the Moxie shutdown as a case study through a qualitative survey of 72 U.S. participants, our findings show that the responsibility is viewed as a shared duty across the robot company, parents, developers, and government. However, these attributions varied by political ideology and parental status of whether they have children. Participants' perceptions of whether the robot service should continue are highly polarized; supporters propose technical, financial, and governmental pathways for continuity, while opponents cite business realities and risks of unhealthy emotional dependency. Ultimately, this research contributes an empirically grounded shared responsibility framework for safeguarding child-robot companionship by detailing how accountability is distributed and contested, informing concrete design and policy implications to mitigate the emotional harm of robot discontinuation.
☆ Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion
Tensegrity robots combine rigid rods and elastic cables, offering high resilience and deployability but posing major challenges for locomotion control due to their underactuated and highly coupled dynamics. This paper introduces a morphology-aware reinforcement learning framework that integrates a graph neural network (GNN) into the Soft Actor-Critic (SAC) algorithm. By representing the robot's physical topology as a graph, the proposed GNN-based policy captures coupling among components, enabling faster and more stable learning than conventional multilayer perceptron (MLP) policies. The method is validated on a physical 3-bar tensegrity robot across three locomotion primitives, including straight-line tracking and bidirectional turning. It shows superior sample efficiency, robustness to noise and stiffness variations, and improved trajectory accuracy. Notably, the learned policies transfer directly from simulation to hardware without fine-tuning, achieving stable real-world locomotion. These results demonstrate the advantages of incorporating structural priors into reinforcement learning for tensegrity robot control.
☆ Accelerating Real-World Overtaking in F1TENTH Racing Employing Reinforcement Learning Methods
While autonomous racing performance in Time-Trial scenarios has seen significant progress and development, autonomous wheel-to-wheel racing and overtaking are still severely limited. These limitations are particularly apparent in real-life driving scenarios where state-of-the-art algorithms struggle to safely or reliably complete overtaking manoeuvres. This is important, as reliable navigation around other vehicles is vital for safe autonomous wheel-to-wheel racing. The F1Tenth Competition provides a useful opportunity for developing wheel-to-wheel racing algorithms on a standardised physical platform. The competition format makes it possible to evaluate overtaking and wheel-to-wheel racing algorithms against the state-of-the-art. This research presents a novel racing and overtaking agent capable of learning to reliably navigate a track and overtake opponents in both simulation and reality. The agent was deployed on an F1Tenth vehicle and competed against opponents running varying competitive algorithms in the real world. The results demonstrate that the agent's training against opponents enables deliberate overtaking behaviours with an overtaking rate of 87% compared 56% for an agent trained just to race.
☆ REALMS2 -- Resilient Exploration And Lunar Mapping System 2 -- A Comprehensive Approach IROS 2025
The European Space Agency (ESA) and the European Space Resources Innovation Centre (ESRIC) created the Space Resources Challenge to invite researchers and companies to propose innovative solutions for Multi-Robot Systems (MRS) space prospection. This paper proposes the Resilient Exploration And Lunar Mapping System 2 (REALMS2), a MRS framework for planetary prospection and mapping. Based on Robot Operating System version 2 (ROS 2) and enhanced with Visual Simultaneous Localisation And Mapping (vSLAM) for map generation, REALMS2 uses a mesh network for a robust ad hoc network. A single graphical user interface (GUI) controls all the rovers, providing a simple overview of the robotic mission. This system is designed for heterogeneous multi-robot exploratory missions, tackling the challenges presented by extraterrestrial environments. REALMS2 was used during the second field test of the ESA-ESRIC Challenge and allowed to map around 60% of the area, using three homogeneous rovers while handling communication delays and blackouts.
comment: 8 Pages, 8 Figures, Submitted and Accepted to IROS 2025
☆ SpikeATac: A Multimodal Tactile Finger with Taxelized Dynamic Sensing for Dexterous Manipulation
In this work, we introduce SpikeATac, a multimodal tactile finger combining a taxelized and highly sensitive dynamic response (PVDF) with a static transduction method (capacitive) for multimodal touch sensing. Named for its `spiky' response, SpikeATac's 16-taxel PVDF film sampled at 4 kHz provides fast, sensitive dynamic signals to the very onset and breaking of contact. We characterize the sensitivity of the different modalities, and show that SpikeATac provides the ability to stop quickly and delicately when grasping fragile, deformable objects. Beyond parallel grasping, we show that SpikeATac can be used in a learning-based framework to achieve new capabilities on a dexterous multifingered robot hand. We use a learning recipe that combines reinforcement learning from human feedback with tactile-based rewards to fine-tune the behavior of a policy to modulate force. Our hardware platform and learning pipeline together enable a difficult dexterous and contact-rich task that has not previously been achieved: in-hand manipulation of fragile objects. Videos are available at \href{https://roamlab.github.io/spikeatac/}{roamlab.github.io/spikeatac}.
comment: 9 pages, 8 figures, under review
☆ A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
☆ A Hermetic, Transparent Soft Growing Vine Robot System for Pipe Inspection
Rehabilitation of aging pipes requires accurate condition assessment and mapping far into the pipe interiors. Soft growing vine robot systems are particularly promising for navigating confined, sinuous paths such as in pipes, but are currently limited by complex subsystems and a lack of validation in real-world industrial settings. In this paper, we introduce the concept and implementation of a hermetic and transparent vine robot system for visual condition assessment and mapping within non-branching pipes. This design encloses all mechanical and electrical components within the vine robot's soft, airtight, and transparent body, protecting them from environmental interference while enabling visual sensing. Because this approach requires an enclosed mechanism for transporting sensors, we developed, modeled, and tested a passively adapting enclosed tip mount. Finally, we validated the hermetic and transparent vine robot system concept through a real-world condition assessment and mapping task in a wastewater pipe. This work advances the use of soft-growing vine robots in pipe inspection by developing and demonstrating a robust, streamlined, field-validated system suitable for continued development and deployment.
comment: 8 pages, 7 figures
☆ Cooperative Integrated Estimation-Guidance for Simultaneous Interception of Moving Targets
This paper proposes a cooperative integrated estimation-guidance framework for simultaneous interception of a non-maneuvering target using a team of unmanned autonomous vehicles, assuming only a subset of vehicles are equipped with dedicated sensors to measure the target's states. Unlike earlier approaches that focus solely on either estimation or guidance design, the proposed framework unifies both within a cooperative architecture. To circumvent the limitation posed by heterogeneity in target observability, sensorless vehicles estimate the target's state by leveraging information exchanged with neighboring agents over a directed communication topology through a prescribed-time observer. The proposed approach employs true proportional navigation guidance (TPNG), which uses an exact time-to-go formulation and is applicable across a wide spectrum of target motions. Furthermore, prescribed-time observer and controller are employed to achieve convergence to true target's state and consensus in time-to-go within set predefined times, respectively. Simulations demonstrate the effectiveness of the proposed framework under various engagement scenarios.
☆ RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification
As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: https://repv-project.github.io/.
comment: Code and data are available at: https://repv-project.github.io/
☆ Heterogeneous Robot Collaboration in Unstructured Environments with Grounded Generative Intelligence
Heterogeneous robot teams operating in realistic settings often must accomplish complex missions requiring collaboration and adaptation to information acquired online. Because robot teams frequently operate in unstructured environments -- uncertain, open-world settings without prior maps -- subtasks must be grounded in robot capabilities and the physical world. While heterogeneous teams have typically been designed for fixed specifications, generative intelligence opens the possibility of teams that can accomplish a wide range of missions described in natural language. However, current large language model (LLM)-enabled teaming methods typically assume well-structured and known environments, limiting deployment in unstructured environments. We present SPINE-HT, a framework that addresses these limitations by grounding the reasoning abilities of LLMs in the context of a heterogeneous robot team through a three-stage process. Given language specifications describing mission goals and team capabilities, an LLM generates grounded subtasks which are validated for feasibility. Subtasks are then assigned to robots based on capabilities such as traversability or perception and refined given feedback collected during online operation. In simulation experiments with closed-loop perception and control, our framework achieves nearly twice the success rate compared to prior LLM-enabled heterogeneous teaming approaches. In real-world experiments with a Clearpath Jackal, a Clearpath Husky, a Boston Dynamics Spot, and a high-altitude UAV, our method achieves an 87\% success rate in missions requiring reasoning about robot capabilities and refining subtasks with online feedback. More information is provided at https://zacravichandran.github.io/SPINE-HT.
☆ NaviTrace: Evaluating Embodied Navigation of Vision-Language Models
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.
comment: 9 pages, 6 figures, under review at IEEE conference
☆ Design for One, Deploy for Many: Navigating Tree Mazes with Multiple Agents
Maze-like environments, such as cave and pipe networks, pose unique challenges for multiple robots to coordinate, including communication constraints and congestion. To address these challenges, we propose a distributed multi-agent maze traversal algorithm for environments that can be represented by acyclic graphs. It uses a leader-switching mechanism where one agent, assuming a head role, employs any single-agent maze solver while the other agents each choose an agent to follow. The head role gets transferred to neighboring agents where necessary, ensuring it follows the same path as a single agent would. The multi-agent maze traversal algorithm is evaluated in simulations with groups of up to 300 agents, various maze sizes, and multiple single-agent maze solvers. It is compared against strategies that are na\"ive, or assume either global communication or full knowledge of the environment. The algorithm outperforms the na\"ive strategy in terms of makespan and sum-of-fuel. It is superior to the global-communication strategy in terms of makespan but is inferior to it in terms of sum-of-fuel. The findings suggest it is asymptotically equivalent to the full-knowledge strategy with respect to either metric. Moreover, real-world experiments with up to 20 Pi-puck robots confirm the feasibility of the approach.
comment: 7 pages, 7 figures, to be published in MRS 2025
☆ Leveraging Foundation Models for Enhancing Robot Perception and Action
This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.
comment: Doctoral thesis
♻ ☆ CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
comment: 39 pages, 24 figures
♻ ☆ Agile and Cooperative Aerial Manipulation of a Cable-Suspended Load
Quadrotors can carry slung loads to hard-to-reach locations at high speed. Since a single quadrotor has limited payload capacities, using a team of quadrotors to collaboratively manipulate a heavy object is a scalable and promising solution. However, existing control algorithms for multi-lifting systems only enable low-speed and low-acceleration operations due to the complex dynamic coupling between quadrotors and the load, limiting their use in time-critical missions such as search and rescue. In this work, we present a solution to significantly enhance the agility of cable-suspended multi-lifting systems. Unlike traditional cascaded solutions, we introduce a trajectory-based framework that solves the whole-body kinodynamic motion planning problem online, accounting for the dynamic coupling effects and constraints between the quadrotors and the load. The planned trajectory is provided to the quadrotors as a reference in a receding-horizon fashion and is tracked by an onboard controller that observes and compensates for the cable tension. Real-world experiments demonstrate that our framework can achieve at least eight times greater acceleration than state-of-the-art methods to follow agile trajectories. Our method can even perform complex maneuvers such as flying through narrow passages at high speed. Additionally, it exhibits high robustness against load uncertainties and does not require adding any sensors to the load, demonstrating strong practicality.
comment: 38 pages, 11 figures
♻ ☆ LiGen: GAN-Augmented Spectral Fingerprinting for Indoor Positioning
Accurate and robust indoor localization is critical for smart building applications, yet existing Wi-Fi-based systems are often vulnerable to environmental conditions. This work presents a novel indoor localization system, called LiGen, that leverages the spectral intensity patterns of ambient light as fingerprints, offering a more stable and infrastructure-free alternative to radio signals. To address the limited spectral data, we design a data augmentation framework based on generative adversarial networks (GANs), featuring two variants: PointGAN, which generates fingerprints conditioned on coordinates, and FreeGAN, which uses a weak localization model to label unconditioned samples. Our positioning model, leveraging a Multi-Layer Perceptron (MLP) architecture to train on synthesized data, achieves submeter-level accuracy, outperforming Wi-Fi-based baselines by over 50\%. LiGen also demonstrates strong robustness in cluttered environments. To the best of our knowledge, this is the first system to combine spectral fingerprints with GAN-based data augmentation for indoor localization.
comment: 6 pages, 10 figures
♻ ☆ C-NAV: Towards Self-Evolving Continual Object Navigation in Open World NeurIPS 2025
Embodied agents are expected to perform object navigation in dynamic, open-world environments. However, existing approaches typically rely on static trajectories and a fixed set of object categories during training, overlooking the real-world requirement for continual adaptation to evolving scenarios. To facilitate related studies, we introduce the continual object navigation benchmark, which requires agents to acquire navigation skills for new object categories while avoiding catastrophic forgetting of previously learned knowledge. To tackle this challenge, we propose C-Nav, a continual visual navigation framework that integrates two key innovations: (1) A dual-path anti-forgetting mechanism, which comprises feature distillation that aligns multi-modal inputs into a consistent representation space to ensure representation consistency, and feature replay that retains temporal features within the action decoder to ensure policy consistency. (2) An adaptive sampling strategy that selects diverse and informative experiences, thereby reducing redundancy and minimizing memory overhead. Extensive experiments across multiple model architectures demonstrate that C-Nav consistently outperforms existing approaches, achieving superior performance even compared to baselines with full trajectory retention, while significantly lowering memory requirements. The code will be publicly available at https://bigtree765.github.io/C-Nav-project.
comment: Accepted at NeurIPS 2025
♻ ☆ Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors
(Visual) Simultaneous Localization and Mapping (SLAM) remains a fundamental challenge in enabling autonomous systems to navigate and understand large-scale environments. Traditional SLAM approaches struggle to balance efficiency and accuracy, particularly in large-scale settings where extensive computational resources are required for scene reconstruction and Bundle Adjustment (BA). However, this scene reconstruction, in the form of sparse pointclouds of visual landmarks, is often only used within the SLAM system because navigation and planning methods require different map representations. In this work, we therefore investigate a more scalable Visual SLAM (VSLAM) approach without reconstruction, mainly based on approaches for two-view loop closures. By restricting the map to a sparse keyframed pose graph without dense geometry representations, our `2GO' system achieves efficient optimization with competitive absolute trajectory accuracy. In particular, we find that recent advancements in image matching and monocular depth priors enable very accurate trajectory optimization without BA. We conduct extensive experiments on diverse datasets, including large-scale scenarios, and provide a detailed analysis of the trade-offs between runtime, accuracy, and map size. Our results demonstrate that this streamlined approach supports real-time performance, scales well in map size and trajectory duration, and effectively broadens the capabilities of VSLAM for long-duration deployments to large environments.
♻ ☆ Learning to Insert for Constructive Neural Vehicle Routing Solver NeurIPS 2025
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
comment: Accepted at NeurIPS 2025
♻ ☆ SAFE: Multitask Failure Detection for Vision-Language-Action Models NeurIPS 2025
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/
comment: NeurIPS 2025 camera ready. Project Page: https://vla-safe.github.io/
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph
Visual-Language Navigation (VLN) is a fundamental challenge in robotic systems, with broad applications for the deployment of embodied agents in real-world environments. Despite recent advances, existing approaches are limited in long-range spatial reasoning, often exhibiting low success rates and high inference latency, particularly in long-range navigation tasks. To address these limitations, we propose FSR-VLN, a vision-language navigation system that combines a Hierarchical Multi-modal Scene Graph (HMSG) with Fast-to-Slow Navigation Reasoning (FSR). The HMSG provides a multi-modal map representation supporting progressive retrieval, from coarse room-level localization to fine-grained goal view and object identification. Building on HMSG, FSR first performs fast matching to efficiently select candidate rooms, views, and objects, then applies VLM-driven refinement for final goal selection. We evaluated FSR-VLN across four comprehensive indoor datasets collected by humanoid robots, utilizing 87 instructions that encompass a diverse range of object categories. FSR-VLN achieves state-of-the-art (SOTA) performance in all datasets, measured by the retrieval success rate (RSR), while reducing the response time by 82% compared to VLM-based methods on tour videos by activating slow reasoning only when fast intuition fails. Furthermore, we integrate FSR-VLN with speech interaction, planning, and control modules on a Unitree-G1 humanoid robot, enabling natural language interaction and real-time navigation.
comment: 8 pages
♻ ☆ Human-assisted Robotic Policy Refinement via Action Preference Optimization NeurIPS 2025
Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for post-deployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration. The code and dataset are released at https://github.com/GeWu-Lab/Action-Preference-Optimization
comment: Accepted By NeurIPS 2025
♻ ☆ 3D Equivariant Visuomotor Policy Learning via Spherical Projection
Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2D RGB camera image onto a sphere. This enables us to reason about symmetries in $\mathrm{SO}(3)$ without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, Image-to-Sphere Policy ($\textbf{ISP}$), is the first $\mathrm{SO}(3)$-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.
♻ ☆ Falconry-like palm landing by a flapping-wing drone based on the human gesture interaction and distance-aware flight planning
Flapping-wing drones have attracted significant attention due to their biomimetic flight. They are considered more human-friendly due to their characteristics such as low noise and flexible wings, making them suitable for human-drone interactions. However, few studies have explored the practical interaction between humans and flapping-wing drones. On establishing a physical interaction system with flapping-wing drones, we can acquire inspirations from falconers who guide birds of prey to land on their arms. This interaction interprets the human body as a dynamic landing platform, which can be utilized in various scenarios such as crowded or spatially constrained environments. Thus, in this study, we propose a falconry-like interaction system in which a flapping-wing drone performs a palm landing motion on a human hand. To achieve a safe approach toward humans, we design a trajectory planning method that considers both physical and psychological factors of the human safety such as the drone's velocity and distance from the user. We use a commercial flapping platform with our implemented motion planning and conduct experiments to evaluate the palm landing performance and safety. The results demonstrate that our approach enables safe and smooth hand landing interactions. To the best of our knowledge, it is the first time to achieve a contact-based interaction between flapping-wing drones and humans.
comment: 8 pages, 14 figures
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation
Reinforcement learning (RL) has enabled advances in humanoid robot locomotion, yet most learning frameworks do not account for mechanical intelligence embedded in parallel actuation mechanisms due to limitations in simulator support for closed kinematic chains. This omission can lead to inaccurate motion modeling and suboptimal policies, particularly for robots with high actuation complexity. This paper presents general formulations and simulation methods for three types of parallel mechanisms: a differential pulley, a five-bar linkage, and a four-bar linkage, and trains a parallel-mechanism aware policy through an end-to-end curriculum RL framework for BRUCE, a kid-sized humanoid robot. Unlike prior approaches that rely on simplified serial approximations, we simulate all closed-chain constraints natively using GPU-accelerated MuJoCo (MJX), preserving the hardware's mechanical nonlinear properties during training. We benchmark our RL approach against a model predictive controller (MPC), demonstrating better surface generalization and performance in real-world zero-shot deployment. This work highlights the computational approaches and performance benefits of fully simulating parallel mechanisms in end-to-end learning pipelines for legged humanoids. Project codes with parallel mechanisms: https://github.com/alvister88/og_bruce
comment: Proceeding to the IEEE Humanoid Conference 2025
♻ ☆ Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization ICML 2025
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
comment: 41 pages, 3 figures, 2 tables. Published in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ Understanding the Application of Utility Theory in Robotics and Artificial Intelligence: A Survey
As a unifying concept in economics, game theory, and operations research, even in the Robotics and AI field, the utility is used to evaluate the level of individual needs, preferences, and interests. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable utility model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. Although these systems' complex, large-scale, and long-term behaviors are strongly determined by the fundamental characteristics of the underlying relationships, there has been less discussion on the theoretical aspects of mechanisms and the fields of applications in Robotics and AI. This paper introduces a utility-orient needs paradigm to describe and evaluate inter and outer relationships among agents' interactions. Then, we survey existing literature in relevant fields to support it and propose several promising research directions along with some open problems deemed necessary for further investigations.
comment: I am not sure whether withdrawing this paper is suitable. However, right now this paper has significant changes in its topic and author. So, I do not want to lead to any confusion about this paper. In the future, it will have a new version. I hope people will not have issues and confusion about the older one
♻ ☆ Object-Centric Kinodynamic Planning for Nonprehensile Robot Rearrangement Manipulation
Nonprehensile actions such as pushing are crucial for addressing multi-object rearrangement problems. Many traditional methods generate robot-centric actions, which differ from intuitive human strategies and are typically inefficient. To this end, we adopt an object-centric planning paradigm and propose a unified framework for addressing a range of large-scale, physics-intensive nonprehensile rearrangement problems challenged by modeling inaccuracies and real-world uncertainties. By assuming each object can actively move without being driven by robot interactions, our planner first computes desired object motions, which are then realized through robot actions generated online via a closed-loop pushing strategy. Through extensive experiments and in comparison with state-of-the-art baselines in both simulation and on a physical robot, we show that our object-centric planning framework can generate more intuitive and task-effective robot actions with significantly improved efficiency. In addition, we propose a benchmarking protocol to standardize and facilitate future research in nonprehensile rearrangement.
♻ ☆ PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework. At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image, eliminating the need for multi-stage pipelines or auxiliary modalities. Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics: by conditioning on sparse video keyframes generated by world models, it produces smooth and continuous long-horizon action sequences through an overlap-averaging strategy. This unified design enables scalable and efficient integration of perception and control. On the DREAM dataset, PoseDiff achieves state-of-the-art accuracy and real-time performance for pose estimation. On Libero-Object manipulation tasks, it substantially improves success rates over existing inverse dynamics modules, even under strict offline settings. Together, these results show that PoseDiff provides a scalable, accurate, and efficient bridge between perception, planning, and control in embodied AI. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/PoseDiff-project-page/.
comment: The experimental setup and metrics lacks rigor, affecting the fairness of the comparisons
Computer Vision and Pattern Recognition 153
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ Masked Diffusion Captioning for Visual Feature Learning EMNLP 2025
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
comment: EMNLP 2025 (Findings). Project page: https://cfeng16.github.io/mdlm4vfl/
☆ SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
comment: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/
☆ Scaling Image Geo-Localization to Continent Level NeurIPS 2025
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.
comment: NeurIPS 2025
☆ The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
☆ ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
☆ SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
☆ MORE: Multi-Organ Medical Image REconstruction Dataset
CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/
comment: Accepted to ACMMM 2025
☆ ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection
Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from {\mu}US, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.
☆ The Impact and Outlook of 3D Gaussian Splatting
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
comment: Article written for Frontiers of Science Award, International Congress on Basic Science, 2025
☆ Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.
☆ Improving Classification of Occluded Objects through Scene Context
The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.
☆ BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.
☆ Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.
☆ All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
☆ SAMRI: Segment Anything Model for MRI
Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN)-based methods can be accurate and efficient, but often generalize poorly to MRI's variable contrast, intensity inhomogeneity, and protocols. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94% and trainable parameters by 96% versus full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small and clinically important structures.
☆ PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model's ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.
☆ Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
☆ ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
☆ Emu3.5: Native Multimodal Models are World Learners
We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
comment: project page: https://emu.world
☆ CATCH: A Modular Cross-domain Adaptive Template with Hook
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
☆ Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.
comment: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
☆ Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation
Olive tree biovolume estimation is a key task in precision agriculture, supporting yield prediction and resource management, especially in Mediterranean regions severely impacted by climate-induced stress. This study presents a comparative analysis of three deep learning models U-Net, YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows in ultra-high resolution UAV imagery. The UAV dataset, acquired over Vicopisano, Italy, includes manually annotated crown and shadow masks. Building on these annotations, the methodology emphasizes spatial feature extraction and robust segmentation; per-tree biovolume is then estimated by combining crown projected area with shadow-derived height using solar geometry. In testing, Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while YOLOv11m-seg provided the fastest throughput (0.12 second per image). The estimated biovolumes spanned from approximately 4 to 24 cubic meters, reflecting clear structural differences among trees. These results indicate Mask R-CNN is preferable when biovolume accuracy is paramount, whereas YOLOv11m-seg suits large-area deployments where speed is critical; U-Net remains a lightweight, high-sensitivity option. The framework enables accurate, scalable orchard monitoring and can be further strengthened with DEM or DSM integration and field calibration for operational decision support.
comment: 6 pages, 2025 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)
☆ AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
comment: Accepted at 32nd International Conference on MultiMedia Modeling
☆ SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.
comment: Accepted by Computerized Medical Imaging and Graphics (CMIG)
☆ Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
☆ Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
comment: 12 pages, 7 figures
☆ PointSt3R: Point Tracking through 3D Grounded Correspondence
Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.
comment: http://rhodriguerrier.github.io/PointSt3R
☆ A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
comment: 23 pages, 14 figures
☆ LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
☆ EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.
comment: Demo paper
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method's effectiveness for real-world applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
Autonomous robots in orchards require real-time 3D scene understanding despite repetitive row geometry, seasonal appearance changes, and wind-driven foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints recovers orchard structure under occlusions, while a unified gradient-driven map lifecycle executed between keyframes preserves fine details and bounds memory. Pose refinement is guided by a probabilistic LiDAR-based depth consistency term, back-propagated through the camera projection to tighten geometry-appearance coupling. We deploy the system on a field platform in apple and pear orchards across dormancy, flowering, and harvesting, using a standardized trajectory protocol that evaluates both training-view and novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while maintaining real-time performance on-tractor. While demonstrated in orchard monitoring, the approach can be applied to other outdoor domains requiring robust multimodal perception.
☆ GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR 2026
Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
comment: 11 pages, 6 figures. Includes supplementary material. Under review as a conference paper at ICLR 2026
☆ A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.
☆ Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents' questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.
comment: Published at IWAIPR 2025 conference
☆ Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth's surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and $16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to $300$ imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in https://github.com/buaa-colalab/AEOSBench.
☆ Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.
comment: Published at IWAIPR 2025 conference
☆ Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.
☆ Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.
comment: Accepted at BIOSIG 2025 conference
☆ Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws NeurIPS 2025
Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.
comment: NeurIPS 2025 spotlight
☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
comment: 10 pages
☆ OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.
comment: TL;DR: With OmniLayout-1M dataset and LLM-based coarse-to-fine learning, we enable universal and diverse document layout generation
☆ Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.
☆ Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction SIGGRAPH
3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
comment: SIGGRAPH Asia 2025
☆ ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NeurIPS 2025
Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.
comment: Published in the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.
comment: 10 pages, 6 figures
☆ Self-localization on a 3D map by fusing global and local features from a monocular camera
Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.
☆ CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
☆ Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.
comment: 16 pages
☆ MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV 2025
Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.
comment: 13 pages, 10 figures, 5 tables
☆ StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.
☆ FullPart: Generating each 3D Part at Full Resolution
Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.
comment: Project page: https://fullpart3d.github.io
☆ Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.
comment: double-column 5 pages, 3 figures
☆ WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
☆ JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.
☆ OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.
☆ EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
comment: project page: \url{https://minjoong507.github.io/projects/EgoExo-Con/}
☆ Security Risk of Misalignment between Text and Image in Multi-modal Model
Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
☆ Dynamic VLM-Guided Negative Prompting for Diffusion Models NeurIPS 2025
We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The First Workshop on Generative and Protective AI for Content Creation
☆ FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
☆ AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception
This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.
comment: Submitted to IEEE Transactions on Intelligent Transportation Systems (IEEE T-ITS)
☆ A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
☆ VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video
This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.
comment: Technical Report. 8 pages, 5 figures. Introduces the VitalLens 2.0 model for rPPG and Heart Rate Variability (HRV) estimation. Project website: https://rouast.com/api
☆ Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning
In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}
☆ MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
☆ Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
☆ Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR AAAI
The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.
comment: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM'26)
☆ SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions
Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.
comment: 17 pages, 10 figures, 8 tables, submitted to "Medical Image Analysis" journal
☆ Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11
Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.
comment: 5 pages, 2 figures
☆ DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting NeurIPS 2025
We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.
comment: Accepted to NeurIPS 2025 / Project page: https://github.com/cgskku/dc4gs
☆ Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction
Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.
☆ PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT
Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.
comment: 22 Pages, 5 Tables, 10 Figures. The combination of GRL and MMD achieved the most balanced performance, reducing contour deviations and enhancing surface smoothness
☆ ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
comment: project page: https://thinkmorph.github.io/
☆ Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
comment: Project page: https://flageval-baai.github.io/MeasureBenchPage/
♻ ☆ Smoothing Slot Attention Iterations and Recurrences
Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.
♻ ☆ Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.
♻ ☆ Locality in Image Diffusion Models Emerges from Data Statistics
Recent work has shown that the generalization ability of image diffusion models arises from the locality properties of the trained neural network. In particular, when denoising a particular pixel, the model relies on a limited neighborhood of the input image around that pixel, which, according to the previous work, is tightly related to the ability of these models to produce novel images. Since locality is central to generalization, it is crucial to understand why diffusion models learn local behavior in the first place, as well as the factors that govern the properties of locality patterns. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset and is not due to the inductive bias of convolutional neural networks, as suggested in previous work. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers. We show, both theoretically and experimentally, that this locality arises directly from pixel correlations present in the image datasets. Moreover, locality patterns are drastically different on specialized datasets, approximating principal components of the data's covariance. We use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than prior expert-crafted alternatives. Our key takeaway is that while neural network architectures influence generation quality, their primary role is to capture locality patterns inherent in the data.
comment: 31 pages, 20 figures, 7 tables
♻ ☆ ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models
Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
♻ ☆ GSE: Group-wise Sparse and Explainable Adversarial Attacks
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100\%$ attack success rate.
♻ ☆ Resource Efficient Multi-stain Kidney Glomeruli Segmentation via Self-supervision
Semantic segmentation under domain shift remains a fundamental challenge in computer vision, particularly when labelled training data is scarce. This challenge is particularly exemplified in histopathology image analysis, where the same tissue structures must be segmented across images captured under different imaging conditions (stains), each representing a distinct visual domain. Traditional deep learning methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple domains (or stains). To mitigate this, various unsupervised domain adaptation based methods such as UDAGAN have been proposed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging. This article shows that through self-supervised pre-training -- including SimCLR, BYOL, and a novel approach, HR-CS-CO -- the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Implementations and pre-trained models are publicly available \href{https://github.com/zeeshannisar/resource-effecient-multi-stain-kidney-glomeruli-segmentation.git}{online}.
comment: 39 pages, 10 figures, 4 Tables
♻ ☆ CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
comment: 39 pages, 24 figures
♻ ☆ Fit for Purpose? Deepfake Detection in the Real World
The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.
♻ ☆ DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios
Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. Recent studies have attempted to enhance the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, due to the limitations of forgery datasets, the practical effectiveness of these methods remains suboptimal. The primary reason lies in the fact that most existing deepfake datasets contain only binary labels, with limited variety in forgery scenarios, insufficient diversity in deepfake types, and relatively small data scales, making them inadequate for complex real-world scenarios.To address this predicament, we construct a novel large-scale deepfake detection and localization (\textbf{DDL}) dataset containing over $\textbf{1.4M+}$ forged samples and encompassing up to $\textbf{80}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) \textbf{Comprehensive Deepfake Methods} (covering 7 different generation architectures and a total of 80 methods), (2) \textbf{Varied Manipulation Modes} (incorporating 7 classic and 3 novel forgery modes), (3) \textbf{Diverse Forgery Scenarios and Modalities} (including 3 scenarios and 3 modalities), and (4) \textbf{Fine-grained Forgery Annotations} (providing 1.18M+ precise spatial masks and 0.23M+ precise temporal segments).Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods.
comment: This paper is a preliminary version, with an extended and comprehensive version currently under development
♻ ☆ HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our "precision-over-volume" principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
comment: 19 pages, 9 figures, 6 tables
♻ ☆ UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025
In recent research, adversarial attacks on person detectors using patches or static 3D model-based texture modifications have struggled with low success rates due to the flexible nature of human movement. Modeling the 3D deformations caused by various actions has been a major challenge. Fortunately, advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer new possibilities. In this paper, we introduce UV-Attack, a groundbreaking approach that achieves high success rates even with extensive and unseen human actions. We address the challenge above by leveraging dynamic-NeRF-based UV mapping. UV-Attack can generate human images across diverse actions and viewpoints, and even create novel actions by sampling from the SMPL parameter space. While dynamic NeRF models are capable of modeling human bodies, modifying clothing textures is challenging because they are embedded in neural network parameters. To tackle this, UV-Attack generates UV maps instead of RGB images and modifies the texture stacks. This approach enables real-time texture edits and makes the attack more practical. We also propose a novel Expectation over Pose Transformation loss (EoPT) to improve the evasion success rate on unseen poses and views. Our experiments show that UV-Attack achieves a 92.7% attack success rate against the FastRCNN model across varied poses in dynamic video settings, significantly outperforming the state-of-the-art AdvCamou attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the latest YOLOv8 detector in black-box settings. This work highlights the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, addressing key challenges in modeling human movement and texture modification. The code is available at https://github.com/PolyLiYJ/UV-Attack.
comment: 23 pages, 22 figures, accepted by ICLR2025
♻ ☆ A Continuous and Interpretable Morphometric for Robust Quantification of Dynamic Biological Shapes
We introduce the Push-Forward Signed Distance Morphometric (PF-SDM) for shape quantification in biomedical imaging. The PF-SDM compactly encodes geometric and topological properties of closed shapes, including their skeleton and symmetries. This provides robust and interpretable features for shape comparison and machine learning. The PF-SDM is mathematically smooth, providing access to gradients and differential-geometric quantities. It also extends to temporal dynamics and allows fusing spatial intensity distributions, such as genetic markers, with shape dynamics. We present the PF-SDM theory, benchmark it on synthetic data, and apply it to predicting body-axis formation in mouse gastruloids, outperforming a CNN baseline in both accuracy and speed.
♻ ☆ VerifIoU - Robustness of Object Detection to Perturbations
We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
♻ ☆ ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes NeurIPS 2025
Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.
comment: Published in NeurIPS 2025
♻ ☆ StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025
Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion. The code is available at https://github.com/PolyLiYJ/StyleGuard.
comment: Accepted by NIPS2025
♻ ☆ LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we first propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code is available at https://github.com/kevinhu314/LATex.
comment: More modifications may be performed
♻ ☆ EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation
Text-to-image diffusion models can generate realistic images based on textual inputs, enabling users to convey their opinions visually through language. Meanwhile, within language, emotion plays a crucial role in expressing personal opinions in our daily lives and the inclusion of maliciously negative content can lead users astray, exacerbating negative emotions. Recognizing the success of diffusion models and the significance of emotion, we investigate a previously overlooked risk associated with text-to-image diffusion models, that is, utilizing emotion in the input texts to introduce negative content and provoke unfavorable emotions in users. Specifically, we identify a new backdoor attack, i.e., emotion-aware backdoor attack (EmoAttack), which introduces malicious negative content triggered by emotional texts during image generation. We formulate such an attack as a diffusion personalization problem to avoid extensive model retraining and propose the EmoBooth. Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a mapping between a cluster of emotional words and a given reference image containing malicious negative content. To validate the effectiveness of our method, we built a dataset and conducted extensive analysis and discussion about its effectiveness. Given consumers' widespread use of diffusion models, uncovering this threat is critical for society.
♻ ☆ Two Heads are Better than One: Robust Learning Meets Multi-branch Models
Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric philosophy, such as how to generate better adversarial examples or use generative models to produce additional training data, we look back to the models themselves and revisit the adversarial robustness from the perspective of deep feature distribution as an insightful complementarity. In this paper, we propose \textit{Branch Orthogonality adveRsarial Training} (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training. To practice our design idea of integrating multiple orthogonal solution spaces, we leverage a simple and straightforward multi-branch neural network that eclipses adversarial attacks with no increase in inference time. We heuristically propose a corresponding loss function, branch-orthogonal loss, to make each solution space of the multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100 and SVHN against $\ell_{\infty}$ norm-bounded perturbations of size $\epsilon = 8/255$, respectively. Exhaustive experiments are conducted to show that our method goes beyond all state-of-the-art methods without any tricks. Compared to all methods that do not use additional data for training, our models achieve 67.3\% and 41.5\% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7.23\% and +9.07\%). We also outperform methods using a training set with a far larger scale than ours.
comment: Camera-ready version for ICPADS 2025
♻ ☆ Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data
Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
comment: The code and model are publicly available here https://github.com/MSA-LMC/S4D
♻ ☆ Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction
Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this work proposes a novel method of TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised contextual sub-data for low-dose CT reconstruction. Firstly, a contextual subdata self-enhancing similarity strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain data for training and testing. Comprehensive evaluation on both benchmark datasets and real-world data demonstrates that TurnDiff consistently outperforms state-of-the-art methods in both reconstruction and generalization.
♻ ☆ Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
comment: 12 pages, 17 figures. Preprint
♻ ☆ SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code will be available.
comment: More modifications may performed
♻ ☆ TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
comment: EMNLP 2025 Oral; Project Homepage: https://yanzehong.github.io/trust-vl/
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning - A Benchmark Dataset and Method ICDM
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
comment: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
♻ ☆ Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
While dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.
♻ ☆ GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
comment: Project Page: https://jingw193.github.io/GRPO-Guard/
♻ ☆ RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection
Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods.
comment: We have updated the journal reference and DOI
♻ ☆ Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on Artificial Intelligence (ECAI) 2025
♻ ☆ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
♻ ☆ Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning
Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, while they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we reveal that CLIP's vulnerabilities primarily stem from its tendency to encode features beyond in-dataset predictive patterns, compromising its visual feature resistivity to input perturbations. This makes its encoded features highly susceptible to being reshaped by backdoor triggers. To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs deep visual prompt tuning with a specially designed feature-repelling loss. Specifically, RVPT adversarially repels the encoded features from deeper layers while optimizing the standard cross-entropy loss, ensuring that only predictive features in downstream tasks are encoded, thereby enhancing CLIP's visual feature resistivity against input perturbations and mitigating its susceptibility to backdoor attacks. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters in CLIP, yet it significantly outperforms state-of-the-art defense methods, reducing the attack success rate from 89.70\% to 2.76\% against the most advanced multimodal attacks on ImageNet and effectively generalizes its defensive capabilities across multiple datasets.
♻ ☆ FARMER: Flow AutoRegressive Transformer over Pixels
Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
comment: Bytedance Seed Technical Report
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
♻ ☆ DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution NeurIPS 2025
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28$\times$ speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
comment: Accepted to NeurIPS 2025. Code is available at: https://github.com/zhengchen1999/DOVE
♻ ☆ Language-guided Open-world Video Anomaly Detection under Weak Supervision
Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask may be considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly scores. Therefore, we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model that dynamically adapts anomaly definitions under weak supervision with two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be released at https://github.com/Kamino666/LaGoVAD-PreVAD.
♻ ☆ Buffer layers for Test-Time Adaptation NeurIPS 2025
In recent advancements in Test Time Adaptation (TTA), most existing methodologies focus on updating normalization layers to adapt to the test domain. However, the reliance on normalization-based adaptation presents key challenges. First, normalization layers such as Batch Normalization (BN) are highly sensitive to small batch sizes, leading to unstable and inaccurate statistics. Moreover, normalization-based adaptation is inherently constrained by the structure of the pre-trained model, as it relies on training-time statistics that may not generalize well to unseen domains. These issues limit the effectiveness of normalization-based TTA approaches, especially under significant domain shift. In this paper, we introduce a novel paradigm based on the concept of a Buffer layer, which addresses the fundamental limitations of normalization layer updates. Unlike existing methods that modify the core parameters of the model, our approach preserves the integrity of the pre-trained backbone, inherently mitigating the risk of catastrophic forgetting during online adaptation. Through comprehensive experimentation, we demonstrate that our approach not only outperforms traditional methods in mitigating domain shift and enhancing model robustness, but also exhibits strong resilience to forgetting. Furthermore, our Buffer layer is modular and can be seamlessly integrated into nearly all existing TTA frameworks, resulting in consistent performance improvements across various architectures. These findings validate the effectiveness and versatility of the proposed solution in real-world domain adaptation scenarios. The code is available at https://github.com/hyeongyu-kim/Buffer_TTA.
comment: Accepted at NeurIPS 2025
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
comment: Accepted to AICCC 2025
♻ ☆ From One to More: Contextual Part Latents for 3D Generation
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
comment: Project page: https://copart3d.github.io/
♻ ☆ Cycle Diffusion Model for Counterfactual Image Generation
Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.
♻ ☆ Empowering Agentic Video Analytics Systems with Video Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations NeurIPS 2025
Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
comment: NeurIPS 2025
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features ICCV 2025
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.
comment: 14 pages, 9 figures, accepted at ICCV 2025
♻ ☆ Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics
For multimodal skeleton-based action recognition, Graph Convolutional Networks (GCNs) are effective models. Still, their reliance on floating-point computations leads to high energy consumption, limiting their applicability in battery-powered devices. While energy-efficient, Spiking Neural Networks (SNNs) struggle to model skeleton dynamics, leading to suboptimal solutions. We propose Signal-SGN (Spiking Graph Convolutional Network), which utilizes the temporal dimension of skeleton sequences as the spike time steps and represents features as multi-dimensional discrete stochastic signals for temporal-frequency domain feature extraction. It combines the 1D Spiking Graph Convolution (1D-SGC) module and the Frequency Spiking Convolution (FSC) module to extract features from the skeleton represented as spiking form. Additionally, the Multi-Scale Wavelet Transform Feature Fusion (MWTF) module is proposed to extract dynamic spiking features and capture frequency-specific characteristics, enhancing classification performance. Experiments across three large-scale datasets reveal Signal-SGN exceeding state-of-the-art SNN-based methods in accuracy and computational efficiency while attaining comparable performance with GCN methods and significantly reducing theoretical energy consumption.
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models NeurIPS 2025
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
comment: NeurIPS 2025 Datasets & Benchmarks
♻ ☆ Neighborhood Feature Pooling for Remote Sensing Image Classification
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures
♻ ☆ GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction
Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.
♻ ☆ GameFactory: Creating New Games with Generative Interactive Videos ICCV 2025
Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.
comment: ICCV 2025 Highlight, Project Page: https://yujiwen.github.io/gamefactory
♻ ☆ Reasoning Visual Language Model for Chest X-Ray Analysis
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
comment: NV-Reason-CXR-3B
♻ ☆ VerifIoU -- Robustness of Object Detection to Perturbations SC
We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
comment: 44th Digital Avionics Systems Conference (DASC), Sep 2025, Montreal, Canada
♻ ☆ D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning -- A Benchmark Dataset and Method ICDM
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
comment: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
♻ ☆ DPA: A one-stop metric to measure bias amplification in classification datasets NeurIPS 2025
Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations -- some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification. To solve these issues, recent work proposed a predictability-based metric called leakage amplification (LA). However, LA cannot identify the direction in which a model amplifies biases. We propose Directional Predictability Amplification (DPA), a predictability-based metric that is (1) directional, (2) works with balanced and unbalanced datasets, and (3) correctly identifies positive and negative bias amplification. DPA eliminates the need to evaluate models on multiple metrics to verify these three aspects. DPA also improves over prior predictability-based metrics like LA: it is less sensitive to the choice of attacker function (a hyperparameter in predictability-based metrics), reports scores within a bounded range, and accounts for dataset bias by measuring relative changes in predictability. Our experiments on well-known datasets like COMPAS (a tabular dataset), COCO, and ImSitu (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification.
comment: Accepted at NeurIPS 2025
♻ ☆ SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
comment: The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
♻ ☆ $\mathtt{M^3VIR}$: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation
The gaming and entertainment industry is rapidly evolving, driven by immersive experiences and the integration of generative AI (GAI) technologies. Training such models effectively requires large-scale datasets that capture the diversity and context of gaming environments. However, existing datasets are often limited to specific domains or rely on artificial degradations, which do not accurately capture the unique characteristics of gaming content. Moreover, benchmarks for controllable video generation remain absent. To address these limitations, we introduce $\mathtt{M^3VIR}$, a large-scale, multi-modal, multi-view dataset specifically designed to overcome the shortcomings of current resources. Unlike existing datasets, $\mathtt{M^3VIR}$ provides diverse, high-fidelity gaming content rendered with Unreal Engine 5, offering authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. It includes $\mathtt{M^3VIR\_MR}$ for super-resolution (SR), novel view synthesis (NVS), and combined NVS+SR tasks, and $\mathtt{M^3VIR\_{MS}}$, the first multi-style, object-level ground-truth set enabling research on controlled video generation. Additionally, we benchmark several state-of-the-art SR and NVS methods to establish performance baselines. While no existing approaches directly handle controlled video generation, $\mathtt{M^3VIR}$ provides a benchmark for advancing this area. By releasing the dataset, we aim to facilitate research in AI-powered restoration, compression, and controllable content generation for next-generation cloud gaming and entertainment.
♻ ☆ StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.
♻ ☆ Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
♻ ☆ How Should One Evaluate Monocular Depth Estimation?
Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.
♻ ☆ CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.
♻ ☆ PROFIT: A Specialized Optimizer for Deep Fine Tuning NeurIPS 2025
The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.
comment: technical report, 23 pages, NeurIPS 2025 poster
♻ ☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
Artificial Intelligence 239
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ Gistify! Codebase-Level Understanding via Runtime Execution
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
comment: Website: https://www.remotelabor.ai
☆ LLMs Process Lists With General Filter Heads
We investigate the mechanisms underlying a range of list-processing tasks in LLMs, and we find that LLMs have learned to encode a compact, causal representation of a general filtering operation that mirrors the generic "filter" function of functional programming. Using causal mediation analysis on a diverse set of list-processing tasks, we find that a small number of attention heads, which we dub filter heads, encode a compact representation of the filtering predicate in their query states at certain tokens. We demonstrate that this predicate representation is general and portable: it can be extracted and reapplied to execute the same filtering operation on different collections, presented in different formats, languages, or even in tasks. However, we also identify situations where transformer LMs can exploit a different strategy for filtering: eagerly evaluating if an item satisfies the predicate and storing this intermediate result as a flag directly in the item representations. Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patterns.
comment: Code and data at https://filter.baulab.info/
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
☆ STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
comment: 10 pages main text, 8 pages supplementary material
☆ AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/
comment: 14 pages, 9 figures
☆ The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
☆ A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation
We introduce the General Incentives-based Framework for Fairness (GIFF), a novel approach for fair multi-agent resource allocation that infers fair decision-making from standard value functions. In resource-constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action-value (Q-)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over-allocation to already well-off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF-modified Q-values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task-demonstrate that our framework consistently outperforms strong baselines and can discover far-sighted, equitable policies. The framework's effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade-off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi-agent systems.
☆ Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.
☆ ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
The expansion of large language models is increasingly limited by the constrained memory capacity of modern GPUs. To mitigate this, Mixture-of-Experts (MoE) architectures activate only a small portion of parameters during inference, significantly lowering both memory demand and computational overhead. However, conventional MoE inference approaches, which select active experts independently at each layer, often introduce considerable latency because of frequent parameter transfers between host and GPU memory. In addition, current cross-layer prediction strategies, which are typically based on fixed steps, lack adaptability across different hardware platforms and workloads, thereby reducing their robustness and effectiveness. To address these challenges, we present ExpertFlow, a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing. ExpertFlow continuously adjusts its prediction horizon for expert activation by leveraging runtime statistics such as transfer bandwidth, parameter dimensionality, and model feedback signals. Furthermore, it incorporates a hybrid cross-layer prediction scheme that fuses pregating information with intermediate computational states to anticipate future expert needs. By adaptively refining prefetching decisions and aligning them with actual usage behavior, ExpertFlow effectively decreases cache misses and removes latency caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline, highlighting its capability to optimize MoE inference under stringent memory constraints.
comment: 12 pages, 11 figures
☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
☆ Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.
☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
☆ Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching
Authorizing Large Language Model driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. We introduce and assess a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents' assigned tasks. Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce ASTRA, a dataset and data generation pipeline for benchmarking semantic matching between tasks and scopes. Our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).
comment: Paper page at https://outshift-open.github.io/ASTRA
☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
☆ Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.
☆ Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.
☆ The Era of Agentic Organization: Learning to Organize with Language Models
We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.
☆ Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.
comment: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror (for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation
☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
☆ Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling
The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models' general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
comment: 34 pages, 9 figures. Code available at https://github.com/RedaElMakroum/agentic-ai-hems
☆ ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
☆ Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent's architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.45% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach. The code is available at https://github.com/LINs-lab/SupervisorAgent.
☆ InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
☆ Multiclass Local Calibration With the Jensen-Shannon Distance
Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. This is especially relevant in high-stakes settings, such as healthcare, where the sparse instances are exactly those most at risk of biased treatment. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method for enhancing local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.
☆ Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
Conventional robots possess a limited understanding of their kinematics and are confined to preprogrammed tasks, hindering their ability to leverage tools efficiently. Driven by the essential components of tool usage - grasping the desired outcome, selecting the most suitable tool, determining optimal tool orientation, and executing precise manipulations - we introduce a pioneering framework. Our novel approach expands the capabilities of the robot's inverse kinematics solver, empowering it to acquire a sequential repertoire of actions using tools of varying lengths. By integrating a simulation-learned action trajectory with the tool, we showcase the practicality of transferring acquired skills from simulation to real-world scenarios through comprehensive experimentation. Remarkably, our extended inverse kinematics solver demonstrates an impressive error rate of less than 1 cm. Furthermore, our trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our model achieves virtually indistinguishable performance when employing two distinct tools of different lengths. This research provides an indication of potential advances in the exploration of all four fundamental aspects of tool usage, enabling robots to master the intricate art of tool manipulation across diverse tasks.
comment: 10 pages, 5 figures. Demonstrates a reinforcement learning framework for adaptive tool manipulation with variable-length extensions
☆ EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge
We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated from military documentation and websites. We also present four new tests sets: (a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k (general military knowledge). On these military test sets, EdgeRunner 20B matches or exceeds GPT-5 task performance with 95%+ statistical significance, except for the high reasoning setting on the combat medic test set and the low reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no statistically-significant regression on general-purpose benchmarks like ARC-C, GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the low reasoning setting. We also present analyses on hyperparameter settings, cost, and throughput. These findings show that small, locally-hosted models are ideal solutions for data-sensitive operations such as in the military domain, allowing for deployment in air-gapped edge devices.
comment: 19 pages
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ Human-AI Complementarity: A Goal for Amplified Oversight
Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight -- the challenge of combining humans and AI to supervise AI systems even as they surpass human expert performance.
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
comment: ICDM 2025 Workshop
☆ Simulating and Experimenting with Social Media Mobilization Using LLM Agents
Online social networks have transformed the ways in which political mobilization messages are disseminated, raising new questions about how peer influence operates at scale. Building on the landmark 61-million-person Facebook experiment \citep{bond201261}, we develop an agent-based simulation framework that integrates real U.S. Census demographic distributions, authentic Twitter network topology, and heterogeneous large language model (LLM) agents to examine the effect of mobilization messages on voter turnout. Each simulated agent is assigned demographic attributes, a personal political stance, and an LLM variant (\texttt{GPT-4.1}, \texttt{GPT-4.1-Mini}, or \texttt{GPT-4.1-Nano}) reflecting its political sophistication. Agents interact over realistic social network structures, receiving personalized feeds and dynamically updating their engagement behaviors and voting intentions. Experimental conditions replicate the informational and social mobilization treatments of the original Facebook study. Across scenarios, the simulator reproduces qualitative patterns observed in field experiments, including stronger mobilization effects under social message treatments and measurable peer spillovers. Our framework provides a controlled, reproducible environment for testing counterfactual designs and sensitivity analyses in political mobilization research, offering a bridge between high-validity field experiments and flexible computational modeling.\footnote{Code and data available at https://github.com/CausalMP/LLM-SocioPol}
☆ Context Engineering 2.0: The Context of Context Engineering
Karl Marx once wrote that ``the human essence is the ensemble of social relations'', suggesting that individuals are not isolated entities but are fundamentally shaped by their interactions with other entities, within which contexts play a constitutive and essential role. With the advent of computers and artificial intelligence, these contexts are no longer limited to purely human--human interactions: human--machine interactions are included as well. Then a central question emerges: How can machines better understand our situations and purposes? To address this challenge, researchers have recently introduced the concept of context engineering. Although it is often regarded as a recent innovation of the agent era, we argue that related practices can be traced back more than twenty years. Since the early 1990s, the field has evolved through distinct historical phases, each shaped by the intelligence level of machines: from early human--computer interaction frameworks built around primitive computers, to today's human--agent interaction paradigms driven by intelligent agents, and potentially to human--level or superhuman intelligence in the future. In this paper, we situate context engineering, provide a systematic definition, outline its historical and conceptual landscape, and examine key design considerations for practice. By addressing these questions, we aim to offer a conceptual foundation for context engineering and sketch its promising future. This paper is a stepping stone for a broader community effort toward systematic context engineering in AI systems.
☆ LINK-KG: LLM-Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.
comment: Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
☆ Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.
☆ Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections
Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p < .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.
comment: 5 pages, 5 figures, HAI 2025: Workshop on Socially Aware and Cooperative Intelligent Systems
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE 2026
Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
comment: Accepted by ICSE 2026. Code and data: https://github.com/SIMIAO515/SecureReviewer
☆ Robust Graph Condensation via Classification Complexity Mitigation
Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of \ModelName\ across diverse attack scenarios.
☆ Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion
Personalized treatment outcome prediction based on trial data for small-sample and rare patient groups is critical in precision medicine. However, the costly trial data limit the prediction performance. To address this issue, we propose a cross-fidelity knowledge distillation and adaptive fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN incorporates a dual-channel knowledge distillation module to extract complementary knowledge from the low-fidelity model, along with an attention-guided fusion module to dynamically integrate multi-source information. Experiments on treatment outcome prediction for the chronic obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to 74.55\%, and strong robustness to varying high-fidelity dataset sizes. Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the exploration of latent medical semantics to support clinical decision-making.
☆ SSCL-BW: Sample-Specific Clean-Label Backdoor Watermarking for Dataset Ownership Verification
The rapid advancement of deep neural networks (DNNs) heavily relies on large-scale, high-quality datasets. However, unauthorized commercial use of these datasets severely violates the intellectual property rights of dataset owners. Existing backdoor-based dataset ownership verification methods suffer from inherent limitations: poison-label watermarks are easily detectable due to label inconsistencies, while clean-label watermarks face high technical complexity and failure on high-resolution images. Moreover, both approaches employ static watermark patterns that are vulnerable to detection and removal. To address these issues, this paper proposes a sample-specific clean-label backdoor watermarking (i.e., SSCL-BW). By training a U-Net-based watermarked sample generator, this method generates unique watermarks for each sample, fundamentally overcoming the vulnerability of static watermark patterns. The core innovation lies in designing a composite loss function with three components: target sample loss ensures watermark effectiveness, non-target sample loss guarantees trigger reliability, and perceptual similarity loss maintains visual imperceptibility. During ownership verification, black-box testing is employed to check whether suspicious models exhibit predefined backdoor behaviors. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method and its robustness against potential watermark removal attacks.
comment: 8 pages,9 figures
☆ Chain-of-Thought Hijacking
Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
☆ LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
☆ MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
☆ Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
comment: 8 pages
☆ Autograder+: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education
The rapid growth of programming education has outpaced traditional assessment tools, leaving faculty with limited means to provide meaningful, scalable feedback. Conventional autograders, while efficient, act as black-box systems that simply return pass/fail results, offering little insight into student thinking or learning needs. Autograder+ is designed to shift autograding from a purely summative process to a formative learning experience. It introduces two key capabilities: automated feedback generation using a fine-tuned Large Language Model, and visualization of student code submissions to uncover learning patterns. The model is fine-tuned on curated student code and expert feedback to ensure pedagogically aligned, context-aware guidance. In evaluation across 600 student submissions from multiple programming tasks, the system produced feedback with strong semantic alignment to instructor comments. For visualization, contrastively learned code embeddings trained on 1,000 annotated submissions enable grouping solutions into meaningful clusters based on functionality and approach. The system also supports prompt-pooling, allowing instructors to guide feedback style through selected prompt templates. By integrating AI-driven feedback, semantic clustering, and interactive visualization, Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes.
☆ A Pragmatic View of AI Personhood
The emergence of agentic Artificial Intelligence (AI) is set to trigger a "Cambrian explosion" of new kinds of personhood. This paper proposes a pragmatic framework for navigating this diversification by treating personhood not as a metaphysical property to be discovered, but as a flexible bundle of obligations (rights and responsibilities) that societies confer upon entities for a variety of reasons, especially to solve concrete governance problems. We argue that this traditional bundle can be unbundled, creating bespoke solutions for different contexts. This will allow for the creation of practical tools -- such as facilitating AI contracting by creating a target "individual" that can be sanctioned -- without needing to resolve intractable debates about an AI's consciousness or rationality. We explore how individuals fit in to social roles and discuss the use of decentralized digital identity technology, examining both "personhood as a problem", where design choices can create "dark patterns" that exploit human social heuristics, and "personhood as a solution", where conferring a bundle of obligations is necessary to ensure accountability or prevent conflict. By rejecting foundationalist quests for a single, essential definition of personhood, this paper offers a more pragmatic and flexible way to think about integrating AI agents into our society.
comment: 40 pages
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.
☆ Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start'), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
comment: 9 pages, 2 figures, 4 tables
☆ AI Mathematician as a Partner in Advancing Mathematical Discovery - A Case Study in Homogenization Theory
Artificial intelligence (AI) has demonstrated impressive progress in mathematical reasoning, yet its integration into the practice of mathematical research remains limited. In this study, we investigate how the AI Mathematician (AIM) system can operate as a research partner rather than a mere problem solver. Focusing on a challenging problem in homogenization theory, we analyze the autonomous reasoning trajectories of AIM and incorporate targeted human interventions to structure the discovery process. Through iterative decomposition of the problem into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results, we reveal how human intuition and machine computation can complement one another. This collaborative paradigm enhances the reliability, transparency, and interpretability of the resulting proofs, while retaining human oversight for formal rigor and correctness. The approach leads to a complete and verifiable proof, and more broadly, demonstrates how systematic human-AI co-reasoning can advance the frontier of mathematical discovery.
comment: 52 pages, 1 figure
☆ BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
☆ The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
☆ Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle
Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.
☆ Discovering State Equivalences in UCT Search Trees By Action Pruning
One approach to enhance Monte Carlo Tree Search (MCTS) is to improve its sample efficiency by grouping/abstracting states or state-action pairs and sharing statistics within a group. Though state-action pair abstractions are mostly easy to find in algorithms such as On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT), nearly no state abstractions are found in either noisy or large action space settings due to constraining conditions. We provide theoretical and empirical evidence for this claim, and we slightly alleviate this state abstraction problem by proposing a weaker state abstraction condition that trades a minor loss in accuracy for finding many more abstractions. We name this technique Ideal Pruning Abstractions in UCT (IPA-UCT), which outperforms OGA-UCT (and any of its derivatives) across a large range of test domains and iteration budgets as experimentally validated. IPA-UCT uses a different abstraction framework from Abstraction of State-Action Pairs (ASAP) which is the one used by OGA-UCT, which we name IPA. Furthermore, we show that both IPA and ASAP are special cases of a more general framework that we call p-ASAP which itself is a special case of the ASASAP framework.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
☆ Linear Causal Discovery with Interventional Constraints
Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks such as designing new treatments. In this paper, we introduce a novel concept in causal discovery, termed interventional constraints, which differs fundamentally from interventional data. While interventional data require direct perturbations of variables, interventional constraints encode high-level causal knowledge in the form of inequality constraints on causal effects. For instance, in the Sachs dataset (Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3 exerts a positive causal effect on Akt. Existing causal discovery methods allow enforcing structural constraints (for example, requiring a causal path from PIP3 to Akt), but they may still produce incorrect causal conclusions such as learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap by explicitly constraining the total causal effect between variable pairs, ensuring learned models respect known causal influences. To formalize interventional constraints, we propose a metric to quantify total causal effects for linear causal models and formulate the problem as a constrained optimization task, solved using a two-stage constrained optimization method. We evaluate our approach on real-world datasets and demonstrate that integrating interventional constraints not only improves model accuracy and ensures consistency with established findings, making models more explainable, but also facilitates the discovery of new causal relationships that would otherwise be costly to identify.
☆ GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR 2026
Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
comment: 11 pages, 6 figures. Includes supplementary material. Under review as a conference paper at ICLR 2026
☆ From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom's taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.
☆ Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics NeurIPS 2025
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and a good approximation to the prior $p(x)$, when can we sample from the posterior $p(x \mid y)$? Posterior sampling provides an accurate and fair framework for tasks such as inpainting, deblurring, and MRI reconstruction, and several heuristics attempt to approximate it. Unfortunately, approximate posterior sampling is computationally intractable in general. To sidestep this hardness, we focus on (local or global) log-concave distributions $p(x)$. In this regime, Langevin dynamics yields posterior samples when the exact scores of $p(x)$ are available, but it is brittle to score--estimation error, requiring an MGF bound (sub-exponential error). By contrast, in the unconditional setting, diffusion models succeed with only an $L^2$ bound on the score error. We prove that combining diffusion models with an annealed variant of Langevin dynamics achieves conditional sampling in polynomial time using merely an $L^4$ bound on the score error.
comment: NeurIPS 2025
☆ GraphCompliance: Aligning Policy and Context Graphs for LLM-Based Regulatory Compliance
Compliance at web scale poses practical challenges: each request may require a regulatory assessment. Regulatory texts (e.g., the General Data Protection Regulation, GDPR) are cross-referential and normative, while runtime contexts are expressed in unstructured natural language. This setting motivates us to align semantic information in unstructured text with the structured, normative elements of regulations. To this end, we introduce GraphCompliance, a framework that represents regulatory texts as a Policy Graph and runtime contexts as a Context Graph, and aligns them. In this formulation, the policy graph encodes normative structure and cross-references, whereas the context graph formalizes events as subject-action-object (SAO) and entity-relation triples. This alignment anchors the reasoning of a judge large language model (LLM) in structured information and helps reduce the burden of regulatory interpretation and event parsing, enabling a focus on the core reasoning step. In experiments on 300 GDPR-derived real-world scenarios spanning five evaluation tasks, GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than LLM-only and RAG baselines, with fewer under- and over-predictions, resulting in higher recall and lower false positive rates. Ablation studies indicate contributions from each graph component, suggesting that structured representations and a judge LLM are complementary for normative reasoning.
comment: Under review at The Web Conference 2026 (Semantics & Knowledge track). Code will be released upon acceptance. This arXiv v1 contains no repository links to preserve double-blind review
☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
comment: 50 pages
☆ Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
☆ Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas's web interaction capabilities using browser-based games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.
☆ Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation
Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.
☆ A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 "Software Engineering 2030" workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan's tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software products.The resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area. Based on these findings, the article finally makes ten predictions for SE in the year 2030.
☆ Graph-Enhanced Policy Optimization in LLM Agent Training
Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.
comment: Under review as a conference paper
☆ Angular Steering: Behavior Control via Rotation in Activation Space NeurIPS 2025
Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at https://github.com/lone17/angular-steering/.
comment: NeurIPS 2025 (Spotlight)
☆ Retrieval Augmented Generation-Enhanced Distributed LLM Agents for Generalizable Traffic Signal Control with Emergency Vehicles
With increasing urban traffic complexity, Traffic Signal Control (TSC) is essential for optimizing traffic flow and improving road safety. Large Language Models (LLMs) emerge as promising approaches for TSC. However, they are prone to hallucinations in emergencies, leading to unreliable decisions that may cause substantial delays for emergency vehicles. Moreover, diverse intersection types present substantial challenges for traffic state encoding and cross-intersection training, limiting generalization across heterogeneous intersections. Therefore, this paper proposes Retrieval Augmented Generation (RAG)-enhanced distributed LLM agents with Emergency response for Generalizable TSC (REG-TSC). Firstly, this paper presents an emergency-aware reasoning framework, which dynamically adjusts reasoning depth based on the emergency scenario and is equipped with a novel Reviewer-based Emergency RAG (RERAG) to distill specific knowledge and guidance from historical cases, enhancing the reliability and rationality of agents' emergency decisions. Secondly, this paper designs a type-agnostic traffic representation and proposes a Reward-guided Reinforced Refinement (R3) for heterogeneous intersections. R3 adaptively samples training experience from diverse intersections with environment feedback-based priority and fine-tunes LLM agents with a designed reward-weighted likelihood loss, guiding REG-TSC toward high-reward policies across heterogeneous intersections. On three real-world road networks with 17 to 177 heterogeneous intersections, extensive experiments show that REG-TSC reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle waiting time by 83.16%, outperforming other state-of-the-art methods.
☆ Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses
Millions of people take surveys every day, from market polls and academic studies to medical questionnaires and customer feedback forms. These datasets capture valuable insights, but their scale and structure present a unique challenge for large language models (LLMs), which otherwise excel at few-shot reasoning over open-ended text. Yet, their ability to process questionnaire data or lists of questions crossed with hundreds of respondent rows remains underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics, SPSS, REDCap) are typically designed for humans in the workflow, limiting such data integration with LLM and AI-empowered automation. This gap leaves scientists, surveyors, and everyday users without evidence-based guidance on how to best represent questionnaires for LLM consumption. We address this by introducing QASU (Questionnaire Analysis and Structural Understanding), a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference, across six serialization formats and multiple prompt strategies. Experiments on contemporary LLMs show that choosing an effective format and prompt combination can improve accuracy by up to 8.8% points compared to suboptimal formats. For specific tasks, carefully adding a lightweight structural hint through self-augmented prompting can yield further improvements of 3-4% points on average. By systematically isolating format and prompting effects, our open source benchmark offers a simple yet versatile foundation for advancing both research and real-world practice in LLM-based questionnaire analysis.
comment: 14 pages, 3 figures, 8 tables
☆ MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines
As a new and promising approach, existing machine unlearning (MU) works typically emphasize theoretical formulations or optimization objectives to achieve knowledge removal. However, when deployed in real-world scenarios, such solutions typically face scalability issues and have to address practical requirements such as full access to original datasets and model. In contrast to the existing approaches, we regard classification training as a sequential process where classes are learned sequentially, which we call \emph{inductive approach}. Unlearning can then be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model. Such an approach does not require full access to the original dataset or the model, addressing the challenges of existing methods. This enables modular and model-agnostic deployment as an output filter into existing classification pipelines with minimal alterations. We conducted multiple experiments across multiple datasets including image (CIFAR-10/100 using CNN-based model) and tabular datasets (Covertype using tree-based model). Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction. This demonstrates the applicability, scalability, and system compatibility of our solution while maintaining the performance of the output in a more practical setting.
comment: 10 pages, 6 figures
☆ Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.
comment: 21 pages, 8 figures
☆ Hybrid LLM and Higher-Order Quantum Approximate Optimization for CSA Collateral Management
We address finance-native collateral optimization under ISDA Credit Support Annexes (CSAs), where integer lots, Schedule A haircuts, RA/MTA gating, and issuer/currency/class caps create rugged, legally bounded search spaces. We introduce a certifiable hybrid pipeline purpose-built for this domain: (i) an evidence-gated LLM that extracts CSA terms to a normalized JSON (abstain-by-default, span-cited); (ii) a quantum-inspired explorer that interleaves simulated annealing with micro higher order QAOA (HO-QAOA) on binding sub-QUBOs (subset size n <= 16, order k <= 4) to coordinate multi-asset moves across caps and RA-induced discreteness; (iii) a weighted risk-aware objective (Movement, CVaR, funding-priced overshoot) with an explicit coverage window U <= Reff+B; and (iv) CP-SAT as single arbiter to certify feasibility and gaps, including a U-cap pre-check that reports the minimal feasible buffer B*. Encoding caps/rounding as higher-order terms lets HO-QAOA target the domain couplings that defeat local swaps. On government bond datasets and multi-CSA inputs, the hybrid improves a strong classical baseline (BL-3) by 9.1%, 9.6%, and 10.7% across representative harnesses, delivering better cost-movement-tail frontiers under governance settings. We release governance grade artifacts-span citations, valuation matrix audit, weight provenance, QUBO manifests, and CP-SAT traces-to make results auditable and reproducible.
comment: 6 pages
☆ Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.
☆ What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.
comment: Code: https://github.com/rmovva/wimhf
☆ Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation NeurIPS 2025
While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.
comment: Accepted in NeurIPS 2025
☆ Predicting All-Cause Hospital Readmissions from Medical Claims Data of Hospitalised Patients
Reducing preventable hospital readmissions is a national priority for payers, providers, and policymakers seeking to improve health care and lower costs. The rate of readmission is being used as a benchmark to determine the quality of healthcare provided by the hospitals. In thisproject, we have used machine learning techniques like Logistic Regression, Random Forest and Support Vector Machines to analyze the health claims data and identify demographic and medical factors that play a crucial role in predicting all-cause readmissions. As the health claims data is high dimensional, we have used Principal Component Analysis as a dimension reduction technique and used the results for building regression models. We compared and evaluated these models based on the Area Under Curve (AUC) metric. Random Forest model gave the highest performance followed by Logistic Regression and Support Vector Machine models. These models can be used to identify the crucial factors causing readmissions and help identify patients to focus on to reduce the chances of readmission, ultimately bringing down the cost and increasing the quality of healthcare provided to the patients.
comment: NCMLAI 2018
☆ ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NeurIPS 2025
Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.
comment: Published in the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Accumulative SGD Influence Estimation for Data Attribution
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.
☆ Linking Heterogeneous Data with Coordinated Agent Flows for Social Media Analysis
Social media platforms generate massive volumes of heterogeneous data, capturing user behaviors, textual content, temporal dynamics, and network structures. Analyzing such data is crucial for understanding phenomena such as opinion dynamics, community formation, and information diffusion. However, discovering insights from this complex landscape is exploratory, conceptually challenging, and requires expertise in social media mining and visualization. Existing automated approaches, though increasingly leveraging large language models (LLMs), remain largely confined to structured tabular data and cannot adequately address the heterogeneity of social media analysis. We present SIA (Social Insight Agents), an LLM agent system that links heterogeneous multi-modal data -- including raw inputs (e.g., text, network, and behavioral data), intermediate outputs, mined analytical results, and visualization artifacts -- through coordinated agent flows. Guided by a bottom-up taxonomy that connects insight types with suitable mining and visualization techniques, SIA enables agents to plan and execute coherent analysis strategies. To ensure multi-modal integration, it incorporates a data coordinator that unifies tabular, textual, and network data into a consistent flow. Its interactive interface provides a transparent workflow where users can trace, validate, and refine the agent's reasoning, supporting both adaptability and trustworthiness. Through expert-centered case studies and quantitative evaluation, we show that SIA effectively discovers diverse and meaningful insights from social media while supporting human-agent collaboration in complex analytical tasks.
☆ One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
☆ Learning to Manage Investment Portfolios beyond Simple Utility Functions
While investment funds publicly disclose their objectives in broad terms, their managers optimize for complex combinations of competing goals that go beyond simple risk-return trade-offs. Traditional approaches attempt to model this through multi-objective utility functions, but face fundamental challenges in specification and parameterization. We propose a generative framework that learns latent representations of fund manager strategies without requiring explicit utility specification. Our approach directly models the conditional probability of a fund's portfolio weights, given stock characteristics, historical returns, previous weights, and a latent variable representing the fund's strategy. Unlike methods based on reinforcement learning or imitation learning, which require specified rewards or labeled expert objectives, our GAN-based architecture learns directly from the joint distribution of observed holdings and market data. We validate our framework on a dataset of 1436 U.S. equity mutual funds. The learned representations successfully capture known investment styles, such as "growth" and "value," while also revealing implicit manager objectives. For instance, we find that while many funds exhibit characteristics of Markowitz-like optimization, they do so with heterogeneous realizations for turnover, concentration, and latent factors. To analyze and interpret the end-to-end model, we develop a series of tests that explain the model, and we show that the benchmark's expert labeling are contained in our model's encoding in a linear interpretable way. Our framework provides a data-driven approach for characterizing investment strategies for applications in market simulation, strategy attribution, and regulatory oversight.
comment: 6th ACM International Conference on AI in Finance, November 15-18, 2025, Singapore
☆ Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series
In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility.
comment: This paper is currently under review for presentation at the IEEE SAMI 2026 Conference
☆ Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment EMNLP 2025
Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule-text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule-description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, highlighting the significance of substructure-aware alignment in molecule-text learning.
comment: EMNLP 2025 (main)
☆ MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV 2025
Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ The FM Agent
Large language models (LLMs) are catalyzing the development of autonomous AI research agents for scientific and engineering discovery. We present FM Agent, a novel and general-purpose multi-agent framework that leverages a synergistic combination of LLM-based reasoning and large-scale evolutionary search to address complex real-world challenges. The core of FM Agent integrates several key innovations: 1) a cold-start initialization phase incorporating expert guidance, 2) a novel evolutionary sampling strategy for iterative optimization, 3) domain-specific evaluators that combine correctness, effectiveness, and LLM-supervised feedback, and 4) a distributed, asynchronous execution infrastructure built on Ray. Demonstrating broad applicability, our system has been evaluated across diverse domains, including operations research, machine learning, GPU kernel optimization, and classical mathematical problems. FM Agent reaches state-of-the-art results autonomously, without human interpretation or tuning -- 1976.3 on ALE-Bench (+5.2\%), 43.56\% on MLE-Bench (+4.0pp), up to 20x speedups on KernelBench, and establishes new state-of-the-art(SOTA) results on several classical mathematical problems. Beyond academic benchmarks, FM Agent shows considerable promise for both large-scale enterprise R\&D workflows and fundamental scientific research, where it can accelerate innovation, automate complex discovery processes, and deliver substantial engineering and scientific advances with broader societal impact.
☆ Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
comment: 9 pages
☆ Beyond Benchmarks: The Economics of AI Inference
The inference cost of Large Language Models (LLMs) has become a critical factor in determining their commercial viability and widespread adoption. This paper introduces a quantitative ``economics of inference'' framework, treating the LLM inference process as a compute-driven intelligent production activity. We analyze its marginal cost, economies of scale, and quality of output under various performance configurations. Based on empirical data from WiNEval-3.0, we construct the first ``LLM Inference Production Frontier,'' revealing three principles: diminishing marginal cost, diminishing returns to scale, and an optimal cost-effectiveness zone. This paper not only provides an economic basis for model deployment decisions but also lays an empirical foundation for the future market-based pricing and optimization of AI inference resources.
☆ Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.
comment: Pre-print prepared for journal submission
☆ WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
☆ EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
comment: project page: \url{https://minjoong507.github.io/projects/EgoExo-Con/}
☆ Security Risk of Misalignment between Text and Image in Multi-modal Model
Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
☆ SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe
☆ GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.
☆ Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.
☆ Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing
Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi-vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi-agent reinforcement learning (MARL) framework for coordinated, network-aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub-based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles micro-routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow-aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network-constrained MARL for scalable, coordinated, and congestion-aware routing in intelligent transportation systems.
comment: 29 pages, 12 figures. Fazel Arasteh and Arian Haghparast contributed equally to this research. Submitted to ACM Transactions on Spatial Algorithms and Systems (TSAS). The code for this work is publicly available at https://github.com/Arianhgh/HHAN
☆ Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Specialized Generalist Models (SGMs) aim to preserve broad capabilities while achieving expert-level performance in target domains. However, traditional LLM structures including Transformer, Linear Attention, and hybrid models do not employ specialized memory mechanism guided by task information. In this paper, we present Nirvana, an SGM with specialized memory mechanism, linear time complexity, and test-time task information extraction. Besides, we propose the Task-Aware Memory Trigger ($\textit{Trigger}$) that flexibly adjusts memory mechanism based on the current task's requirements. In Trigger, each incoming sample is treated as a self-supervised fine-tuning task, enabling Nirvana to adapt its task-related parameters on the fly to domain shifts. We also design the Specialized Memory Updater ($\textit{Updater}$) that dynamically memorizes the context guided by Trigger. We conduct experiments on both general language tasks and specialized medical tasks. On a variety of natural language modeling benchmarks, Nirvana achieves competitive or superior results compared to the existing LLM structures. To prove the effectiveness of Trigger on specialized tasks, we test Nirvana's performance on a challenging medical task, i.e., Magnetic Resonance Imaging (MRI). We post-train frozen Nirvana backbone with lightweight codecs on paired electromagnetic signals and MRI images. Despite the frozen Nirvana backbone, Trigger guides the model to adapt to the MRI domain with the change of task-related parameters. Nirvana achieves higher-quality MRI reconstruction compared to conventional MRI models as well as the models with traditional LLMs' backbone, and can also generate accurate preliminary clinical reports accordingly.
☆ Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization
This paper proposes a novel paradigm for machine learning that moves beyond traditional parameter optimization. Unlike conventional approaches that search for optimal parameters within a fixed geometric space, our core idea is to treat the model itself as a malleable geometric entity. Specifically, we optimize the metric tensor field on a manifold with a predefined topology, thereby dynamically shaping the geometric structure of the model space. To achieve this, we construct a variational framework whose loss function carefully balances data fidelity against the intrinsic geometric complexity of the manifold. The former ensures the model effectively explains observed data, while the latter acts as a regularizer, penalizing overly curved or irregular geometries to encourage simpler models and prevent overfitting. To address the computational challenges of this infinite-dimensional optimization problem, we introduce a practical method based on discrete differential geometry: the continuous manifold is discretized into a triangular mesh, and the metric tensor is parameterized by edge lengths, enabling efficient optimization using automatic differentiation tools. Theoretical analysis reveals a profound analogy between our framework and the Einstein-Hilbert action in general relativity, providing an elegant physical interpretation for the concept of "data-driven geometry". We further argue that even with fixed topology, metric optimization offers significantly greater expressive power than models with fixed geometry. This work lays a solid foundation for constructing fully dynamic "meta-learners" capable of autonomously evolving their geometry and topology, and it points to broad application prospects in areas such as scientific model discovery and robust representation learning.
comment: 9 pages
☆ Data-driven Projection Generation for Efficiently Solving Heterogeneous Quadratic Programming Problems
We propose a data-driven framework for efficiently solving quadratic programming (QP) problems by reducing the number of variables in high-dimensional QPs using instance-specific projection. A graph neural network-based model is designed to generate projections tailored to each QP instance, enabling us to produce high-quality solutions even for previously unseen problems. The model is trained on heterogeneous QPs to minimize the expected objective value evaluated on the projected solutions. This is formulated as a bilevel optimization problem; the inner optimization solves the QP under a given projection using a QP solver, while the outer optimization updates the model parameters. We develop an efficient algorithm to solve this bilevel optimization problem, which computes parameter gradients without backpropagating through the solver. We provide a theoretical analysis of the generalization ability of solving QPs with projection matrices generated by neural networks. Experimental results demonstrate that our method produces high-quality feasible solutions with reduced computation time, outperforming existing methods.
☆ Can AI be Accountable?
The AI we use is powerful, and its power is increasing rapidly. If this powerful AI is to serve the needs of consumers, voters, and decision makers, then it is imperative that the AI is accountable. In general, an agent is accountable to a forum if the forum can request information from the agent about its actions, if the forum and the agent can discuss this information, and if the forum can sanction the agent. Unfortunately, in too many cases today's AI is not accountable -- we cannot question it, enter into a discussion with it, let alone sanction it. In this chapter we relate the general definition of accountability to AI, we illustrate what it means for AI to be accountable and unaccountable, and we explore approaches that can improve our chances of living in a world where all AI is accountable to those who are affected by it.
comment: To be published as a chapter in Daniele Quercia and Marios Constantinides (Eds.). Operationalizing Responsible AI. Cambridge University Press. Forthcoming
☆ Dynamic VLM-Guided Negative Prompting for Diffusion Models NeurIPS 2025
We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The First Workshop on Generative and Protective AI for Content Creation
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
☆ SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
☆ Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.
☆ AI Mathematician as a Partner in Advancing Mathematical Discovery -- A Case Study in Homogenization Theory
Artificial intelligence (AI) has demonstrated impressive progress in mathematical reasoning, yet its integration into the practice of mathematical research remains limited. In this study, we investigate how the AI Mathematician (AIM) system can operate as a research partner rather than a mere problem solver. Focusing on a challenging problem in homogenization theory, we analyze the autonomous reasoning trajectories of AIM and incorporate targeted human interventions to structure the discovery process. Through iterative decomposition of the problem into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results, we reveal how human intuition and machine computation can complement one another. This collaborative paradigm enhances the reliability, transparency, and interpretability of the resulting proofs, while retaining human oversight for formal rigor and correctness. The approach leads to a complete and verifiable proof, and more broadly, demonstrates how systematic human-AI co-reasoning can advance the frontier of mathematical discovery.
comment: 52 pages, 1 figure
☆ Detecting Data Contamination in LLMs via In-Context Learning
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
☆ Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement
Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA's Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning. Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25\%) and query rephrasal errors (3.2\%). Using NVIDIA NeMo microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96\% accuracy, a 10x reduction in model size, and 70\% latency improvement. For query rephrasal, fine-tuning yielded a 3.7\% gain in accuracy and a 40\% latency reduction. Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.
comment: 20 pages, 5 figures, 5 tables. Presents MAPE-K control loop application to enterprise AI agent improvement with experimental validation on NVIDIA's NVInfo AI system
☆ e1: Learning Adaptive Control of Reasoning Effort
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables approximately 3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
Dataset Creation and Baseline Models for Sexism Detection in Hausa
Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
comment: 9 pages, 1 figure, 4 tables
☆ Elastic Architecture Search for Efficient Language Models ICME 2025
As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.
comment: ICME 2025
☆ A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
☆ Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models NeurIPS 2025
Language models are traditionally designed around causal masking. In domains with spatial or relational structure, causal masking is often viewed as inappropriate, and sequential linearizations are instead used. Yet the question of whether it is viable to accept the information loss introduced by causal masking on nonsequential data has received little direct study, in part because few domains offer both spatial and sequential representations of the same dataset. In this work, we investigate this issue in the domain of chess, which naturally supports both representations. We train language models with bidirectional and causal self-attention mechanisms on both spatial (board-based) and sequential (move-based) data. Our results show that models trained on spatial board states - \textit{even with causal masking} - consistently achieve stronger playing strength than models trained on sequential data. While our experiments are conducted on chess, our results are methodological and may have broader implications: applying causal masking to spatial data is a viable procedure for training unimodal LLMs on spatial data, and in some domains is even preferable to sequentialization.
comment: 8 pages, NeurIPS 2025
☆ Jasmine: A Simple, Performant and Scalable JAX-based World Modeling Codebase
While world models are increasingly positioned as a pathway to overcoming data scarcity in domains such as robotics, open training infrastructure for world modeling remains nascent. We introduce Jasmine, a performant JAX-based world modeling codebase that scales from single hosts to hundreds of accelerators with minimal code changes. Jasmine achieves an order-of-magnitude faster reproduction of the CoinRun case study compared to prior open implementations, enabled by performance optimizations across data loading, training and checkpointing. The codebase guarantees fully reproducible training and supports diverse sharding configurations. By pairing Jasmine with curated large-scale datasets, we establish infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.
comment: Blog post: https://pdoom.org/jasmine.html
☆ A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms
Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
comment: Preprint of a paper presented at GI Skill 2025. The final version will appear in the conference proceedings
☆ AIOT based Smart Education System: A Dual Layer Authentication and Context-Aware Tutoring Framework for Learning Environments
The AIoT-Based Smart Education System integrates Artificial Intelligence and IoT to address persistent challenges in contemporary classrooms: attendance fraud, lack of personalization, student disengagement, and inefficient resource use. The unified platform combines four core modules: (1) a dual-factor authentication system leveraging RFID-based ID scans and WiFi verification for secure, fraud-resistant attendance; (2) an AI-powered assistant that provides real-time, context-aware support and dynamic quiz generation based on instructor-supplied materials; (3) automated test generators to streamline adaptive assessment and reduce administrative overhead; and (4) the EcoSmart Campus module, which autonomously regulates classroom lighting, air quality, and temperature using IoT sensors and actuators. Simulated evaluations demonstrate the system's effectiveness in delivering robust real-time monitoring, fostering inclusive engagement, preventing fraudulent practices, and supporting operational scalability. Collectively, the AIoT-Based Smart Education System offers a secure, adaptive, and efficient learning environment, providing a scalable blueprint for future educational innovation and improved student outcomes through the synergistic application of artificial intelligence and IoT technologies.
☆ LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval
Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.
comment: 8 pages
☆ SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation SC2
The global agricultural sector is undergoing a transformative shift, driven by increasing food demands, climate variability and the need for sustainable practices. SUSTAINABLE is a smart farming platform designed to integrate IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture with a pilot usecase in viticulture. This paper explores current smart agriculture solutions, presents a comparative evaluation, and introduces SUSTAINABLE's key features, including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.
comment: Accepted for presentation to 11th IEEE International Smart Cities Conference (ISC2 2025)
☆ Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30% of the original budget.
Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations
Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.
☆ Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients' visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
☆ Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR AAAI
The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.
comment: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM'26)
☆ Can machines think efficiently?
The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.
☆ LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks
The Internet of Things has expanded rapidly, transforming communication and operations across industries but also increasing the attack surface and security breaches. Artificial Intelligence plays a key role in securing IoT, enabling attack detection, attack behavior analysis, and mitigation suggestion. Despite advancements, evaluations remain purely qualitative, and the lack of a standardized, objective benchmark for quantitatively measuring AI-based attack analysis and mitigation hinders consistent assessment of model effectiveness. In this work, we propose a hybrid framework combining Machine Learning (ML) for multi-class attack detection with Large Language Models (LLMs) for attack behavior analysis and mitigation suggestion. After benchmarking several ML and Deep Learning (DL) classifiers on the Edge-IIoTset and CICIoT2023 datasets, we applied structured role-play prompt engineering with Retrieval-Augmented Generation (RAG) to guide ChatGPT-o3 and DeepSeek-R1 in producing detailed, context-aware responses. We introduce novel evaluation metrics for quantitative assessment to guide us and an ensemble of judge LLMs, namely ChatGPT-4o, DeepSeek-V3, Mixtral 8x7B Instruct, Gemini 2.5 Flash, Meta Llama 4, TII Falcon H1 34B Instruct, xAI Grok 3, and Claude 4 Sonnet, to independently evaluate the responses. Results show that Random Forest has the best detection model, and ChatGPT-o3 outperformed DeepSeek-R1 in attack analysis and mitigation.
☆ Mind the Gaps: Auditing and Reducing Group Inequity in Large-Scale Mobility Prediction
Next location prediction underpins a growing number of mobility, retail, and public-health applications, yet its societal impacts remain largely unexplored. In this paper, we audit state-of-the-art mobility prediction models trained on a large-scale dataset, highlighting hidden disparities based on user demographics. Drawing from aggregate census data, we compute the difference in predictive performance on racial and ethnic user groups and show a systematic disparity resulting from the underlying dataset, resulting in large differences in accuracy based on location and user groups. To address this, we propose Fairness-Guided Incremental Sampling (FGIS), a group-aware sampling strategy designed for incremental data collection settings. Because individual-level demographic labels are unavailable, we introduce Size-Aware K-Means (SAKM), a clustering method that partitions users in latent mobility space while enforcing census-derived group proportions. This yields proxy racial labels for the four largest groups in the state: Asian, Black, Hispanic, and White. Built on these labels, our sampling algorithm prioritizes users based on expected performance gains and current group representation. This method incrementally constructs training datasets that reduce demographic performance gaps while preserving overall accuracy. Our method reduces total disparity between groups by up to 40\% with minimal accuracy trade-offs, as evaluated on a state-of-art MetaPath2Vec model and a transformer-encoder model. Improvements are most significant in early sampling stages, highlighting the potential for fairness-aware strategies to deliver meaningful gains even in low-resource settings. Our findings expose structural inequities in mobility prediction pipelines and demonstrate how lightweight, data-centric interventions can improve fairness with little added complexity, especially for low-data applications.
☆ RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification
As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: https://repv-project.github.io/.
comment: Code and data are available at: https://repv-project.github.io/
☆ Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11
Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.
comment: 5 pages, 2 figures
☆ Heterogeneous Robot Collaboration in Unstructured Environments with Grounded Generative Intelligence
Heterogeneous robot teams operating in realistic settings often must accomplish complex missions requiring collaboration and adaptation to information acquired online. Because robot teams frequently operate in unstructured environments -- uncertain, open-world settings without prior maps -- subtasks must be grounded in robot capabilities and the physical world. While heterogeneous teams have typically been designed for fixed specifications, generative intelligence opens the possibility of teams that can accomplish a wide range of missions described in natural language. However, current large language model (LLM)-enabled teaming methods typically assume well-structured and known environments, limiting deployment in unstructured environments. We present SPINE-HT, a framework that addresses these limitations by grounding the reasoning abilities of LLMs in the context of a heterogeneous robot team through a three-stage process. Given language specifications describing mission goals and team capabilities, an LLM generates grounded subtasks which are validated for feasibility. Subtasks are then assigned to robots based on capabilities such as traversability or perception and refined given feedback collected during online operation. In simulation experiments with closed-loop perception and control, our framework achieves nearly twice the success rate compared to prior LLM-enabled heterogeneous teaming approaches. In real-world experiments with a Clearpath Jackal, a Clearpath Husky, a Boston Dynamics Spot, and a high-altitude UAV, our method achieves an 87\% success rate in missions requiring reasoning about robot capabilities and refining subtasks with online feedback. More information is provided at https://zacravichandran.github.io/SPINE-HT.
☆ Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations
Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.
comment: 10.5 pages, 9 figures
☆ How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison
The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk's xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce "truthful" entries via the large language model Grok. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study undertakes a large-scale computational comparison of 382 matched article pairs between Grokipedia and Wikipedia. Using metrics across lexical richness, readability, structural organization, reference density, and semantic similarity, we assess how closely the two platforms align in form and substance. The results show that while Grokipedia exhibits strong semantic and stylistic alignment with Wikipedia, it typically produces longer but less lexically diverse articles, with fewer references per word and more variable structural depth. These findings suggest that AI-generated encyclopedic content currently mirrors Wikipedia's informational scope but diverges in editorial norms, favoring narrative expansion over citation-based verification. The implications highlight new tensions around transparency, provenance, and the governance of knowledge in an era of automated text generation.
comment: 13 pages, 5 figures, 2 tables
☆ BI-DCGAN: A Theoretically Grounded Bayesian Framework for Efficient and Diverse GANs
Generative Adversarial Networks (GANs) are proficient at generating synthetic data but continue to suffer from mode collapse, where the generator produces a narrow range of outputs that fool the discriminator but fail to capture the full data distribution. This limitation is particularly problematic, as generative models are increasingly deployed in real-world applications that demand both diversity and uncertainty awareness. In response, we introduce BI-DCGAN, a Bayesian extension of DCGAN that incorporates model uncertainty into the generative process while maintaining computational efficiency. BI-DCGAN integrates Bayes by Backprop to learn a distribution over network weights and employs mean-field variational inference to efficiently approximate the posterior distribution during GAN training. We establishes the first theoretical proof, based on covariance matrix analysis, that Bayesian modeling enhances sample diversity in GANs. We validate this theoretical result through extensive experiments on standard generative benchmarks, demonstrating that BI-DCGAN produces more diverse and robust outputs than conventional DCGANs, while maintaining training efficiency. These findings position BI-DCGAN as a scalable and timely solution for applications where both diversity and uncertainty are critical, and where modern alternatives like diffusion models remain too resource-intensive.
☆ The Denario project: Deep knowledge AI agents for scientific discovery
We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.
comment: 272 pages. Examples of 11 AI-generated paper drafts from different scientific disciplines. Code publicly available at https://github.com/AstroPilot-AI/Denario
☆ Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
comment: Project page: https://flageval-baai.github.io/MeasureBenchPage/
☆ Leveraging Foundation Models for Enhancing Robot Perception and Action
This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.
comment: Doctoral thesis
☆ Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base
Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search -- retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.
comment: 43 pages, 4 figures
☆ CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions
Large Language Model (LLM) agents have evolved from basic text generation to autonomously completing complex tasks through interaction with external tools. However, current benchmarks mainly assess end-to-end performance in fixed scenarios, restricting evaluation to specific skills and suffering from score saturation and growing dependence on expert annotation as agent capabilities improve. In this work, we emphasize the importance of learning ability, including both self-improvement and peer-learning, as a core driver for agent evolution toward human-level intelligence. We propose an iterative, competitive peer-learning framework, which allows agents to refine and optimize their strategies through repeated interactions and feedback, thereby systematically evaluating their learning capabilities. To address the score saturation issue in current benchmarks, we introduce CATArena, a tournament-style evaluation platform featuring four diverse board and card games with open-ended scoring. By providing tasks without explicit upper score limits, CATArena enables continuous and dynamic evaluation of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding.
☆ Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.
comment: 16 pages, 9 figures
♻ ☆ UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
♻ ☆ Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
♻ ☆ Controlling Thinking Speed in Reasoning Models NeurIPS 2025
Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
comment: NeurIPS 2025 Spotlight
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models: https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ Curriculum Abductive Learning NeurIPS 2025
Abductive Learning (ABL) integrates machine learning with logical reasoning in a loop: a learning model predicts symbolic concept labels from raw inputs, which are revised through abduction using domain knowledge and then fed back for retraining. However, due to the nondeterminism of abduction, the training process often suffers from instability, especially when the knowledge base is large and complex, resulting in a prohibitively large abduction space. While prior works focus on improving candidate selection within this space, they typically treat the knowledge base as a static black box. In this work, we propose Curriculum Abductive Learning (C-ABL), a method that explicitly leverages the internal structure of the knowledge base to address the ABL training challenges. C-ABL partitions the knowledge base into a sequence of sub-bases, progressively introduced during training. This reduces the abduction space throughout training and enables the model to incorporate logic in a stepwise, smooth way. Experiments across multiple tasks show that C-ABL outperforms previous ABL implementations, significantly improves training stability, convergence speed, and final accuracy, especially under complex knowledge setting.
comment: Accepted by NeurIPS 2025, 22 pages, 6 figures
♻ ☆ Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models WACV 2026
Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data, though limited in volume and potentially unrepresentative, enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.
comment: Accepted at WACV 2026
♻ ☆ Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning
Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.
♻ ☆ CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting ISWC 2025
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
comment: Research Track, 24th International Semantic Web Conference (ISWC 2025), November 2-6, 2025, Nara, Japan
♻ ☆ Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.
♻ ☆ Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient
Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
♻ ☆ SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing
Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.
comment: 11 pages
♻ ☆ AI's Social Forcefield: Reshaping Distributed Cognition in Human-AI Teams
AI is not only a neutral tool in team settings; it actively reshapes the social and cognitive fabric of collaboration. We advance a unified framework of alignment in distributed cognition in human-AI teams -- a process through which linguistic, cognitive, and social coordination emerge as human and AI agents co-construct a shared representational space. Across two studies, we show that exposure to AI-generated language shapes not only how people speak, but also how they think, what they attend to, and how they relate to each other. Together, these findings reveal how AI participation reorganizes the distributed cognitive architecture of teams: AI systems function as implicit social forcefields. Our findings highlight the double-edged impact of AI: the same mechanisms that enable efficient collaboration can also erode epistemic diversity and undermine natural alignment processes. We argue for rethinking AI in teams as a socially influential actor and call for new design paradigms that foreground transparency, controllability, and group-level dynamics to foster responsible, productive human-AI collaboration.
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model description
♻ ☆ Incentivizing LLMs to Self-Verify Their Answers
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.
♻ ☆ A mathematical certification for positivity conditions in Neural Networks with applications to partial monotonicity and Trustworthy AI
Artificial Neural Networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their black-box nature poses trustworthiness challenges. In certain situations, ensuring trust in predictions might require following specific partial monotonicity constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this paper presents a novel algorithm (LipVor) that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Consequently, since partial monotonicity can be expressed as a positivity condition on partial derivatives, LipVor can certify whether an ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the black-box model is used to construct a specific neighborhood where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Unlike prior methods, our approach certifies partial monotonicity without constrained architectures or piece-wise linear activations. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.
comment: 16 pages, 4 figures
♻ ☆ Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents -- Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.
♻ ☆ UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025
In recent research, adversarial attacks on person detectors using patches or static 3D model-based texture modifications have struggled with low success rates due to the flexible nature of human movement. Modeling the 3D deformations caused by various actions has been a major challenge. Fortunately, advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer new possibilities. In this paper, we introduce UV-Attack, a groundbreaking approach that achieves high success rates even with extensive and unseen human actions. We address the challenge above by leveraging dynamic-NeRF-based UV mapping. UV-Attack can generate human images across diverse actions and viewpoints, and even create novel actions by sampling from the SMPL parameter space. While dynamic NeRF models are capable of modeling human bodies, modifying clothing textures is challenging because they are embedded in neural network parameters. To tackle this, UV-Attack generates UV maps instead of RGB images and modifies the texture stacks. This approach enables real-time texture edits and makes the attack more practical. We also propose a novel Expectation over Pose Transformation loss (EoPT) to improve the evasion success rate on unseen poses and views. Our experiments show that UV-Attack achieves a 92.7% attack success rate against the FastRCNN model across varied poses in dynamic video settings, significantly outperforming the state-of-the-art AdvCamou attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the latest YOLOv8 detector in black-box settings. This work highlights the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, addressing key challenges in modeling human movement and texture modification. The code is available at https://github.com/PolyLiYJ/UV-Attack.
comment: 23 pages, 22 figures, accepted by ICLR2025
♻ ☆ Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling NeurIPS 2025
Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.
comment: Accepted at NeurIPS 2025
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS 2025
To recognize and mitigate the harms of generative AI systems, it is crucial to consider who is represented in the outputs of generative AI systems and how people are represented. A critical gap emerges when naively improving who is represented, as this does not imply bias mitigation efforts have been applied to address how people are represented. We critically examined this by investigating gender representation in occupation across state-of-the-art large language models. We first show evidence suggesting that over time there have been interventions to models altering the resulting gender distribution, and we find that women are more represented than men when models are prompted to generate biographies or personas. We then demonstrate that representational biases persist in how different genders are represented by examining statistically significant word differences across genders. This results in a proliferation of representational harms, stereotypes, and neoliberalism ideals that, despite existing interventions to increase female representation, reinforce existing systems of oppression.
comment: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) as a poster paper; 39 pages, 7 figures, 15 tables
♻ ☆ VerifIoU - Robustness of Object Detection to Perturbations
We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning ICASSP 2025
Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.
comment: ICASSP 2025
♻ ☆ StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025
Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion. The code is available at https://github.com/PolyLiYJ/StyleGuard.
comment: Accepted by NIPS2025
♻ ☆ Toward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLMs
Generative artificial intelligence (Gen AI) systems represent a critical technology with far-reaching implications across multiple domains of society. However, their deployment entails a range of risks and challenges that require careful evaluation. To date, there has been a lack of comprehensive, interdisciplinary studies offering a systematic comparison between open-source and proprietary (closed) generative AI systems, particularly regarding their respective advantages and drawbacks. This study aims to: i) critically evaluate and compare the characteristics, opportunities, and challenges of open and closed generative AI models; and ii) propose foundational elements for the development of an Open, Public, and Safe Gen AI framework. As a methodology, we adopted a combined approach that integrates three methods: literature review, critical analysis, and comparative analysis. The proposed framework outlines key dimensions, openness, public governance, and security, as essential pillars for shaping the future of trustworthy and inclusive Gen AI. Our findings reveal that open models offer greater transparency, auditability, and flexibility, enabling independent scrutiny and bias mitigation. In contrast, closed systems often provide better technical support and ease of implementation, but at the cost of unequal access, accountability, and ethical oversight. The research also highlights the importance of multi-stakeholder governance, environmental sustainability, and regulatory frameworks in ensuring responsible development.
♻ ☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. These games typically fall into three categories based on the flow of control: strictly sequential (players alternate single actions), deterministic response (some actions trigger a fixed outcome), and unbounded reciprocal response (alternating counterplays are permitted). A less-explored but strategically rich structure is the bounded one-sided response, where a player's action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that isolates this dynamic, where a Rent action forces the opponent to choose payment assets. The gold-standard algorithm, Counterfactual Regret Minimization (CFR), converges on effective strategies without novel algorithmic extensions. A lightweight full-stack research platform unifies the environment, a parallelized CFR runtime, and a human-playable web interface. The trained CFR agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
♻ ☆ Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.
comment: 38 pages, under review
♻ ☆ In Defence of Post-hoc Explainability NeurIPS 2024
This position paper defends post-hoc explainability methods as legitimate tools for scientific knowledge production in machine learning. Addressing criticism of these methods' reliability and epistemic status, we develop a philosophical framework grounded in mediated understanding and bounded factivity. We argue that scientific insights can emerge through structured interpretation of model behaviour without requiring complete mechanistic transparency, provided explanations acknowledge their approximative nature and undergo rigorous empirical validation. Through analysis of recent biomedical ML applications, we demonstrate how post-hoc methods, when properly integrated into scientific practice, generate novel hypotheses and advance phenomenal understanding.
comment: v1 presented at the Interpretable AI: Past, Present, and Future Workshop at NeurIPS 2024 (non-archival)
♻ ☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
♻ ☆ Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction
Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this work proposes a novel method of TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised contextual sub-data for low-dose CT reconstruction. Firstly, a contextual subdata self-enhancing similarity strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain data for training and testing. Comprehensive evaluation on both benchmark datasets and real-world data demonstrates that TurnDiff consistently outperforms state-of-the-art methods in both reconstruction and generalization.
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ AIMeter: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents AIMeter, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, AIMeter offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, AIMeter encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/AIMeter.
comment: 11 pages, 7 figures and 5 tables
♻ ☆ UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is designed to effectively evaluate both UNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.
comment: v3: Switch the paper template. Work in progress. Github: https://github.com/meituan-longcat/UNO-Bench Hugging Face: https://huggingface.co/datasets/meituan-longcat/UNO-Bench
♻ ☆ Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective IEEE VIS 25
Human-AI collaborative tools attract attentions from the data storytelling community to lower the expertise barrier and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify consistently widely studied patterns, e.g., human-creator + AI-assistant, and newly explored or emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.
comment: This paper is a sequel to the CHI 24 paper "Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration (https://doi.org/10.1145/3613904.3642726), aiming to refresh our understanding with the latest advancements. It is accepted at IEEE VIS 25
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
♻ ☆ Chaos-based reinforcement learning with TD3
Chaos-based reinforcement learning (CBRL) is a method in which the agent's internal chaotic dynamics drives exploration. However, the learning algorithms in CBRL have not been thoroughly developed in previous studies, nor have they incorporated recent advances in reinforcement learning. This study introduced Twin Delayed Deep Deterministic Policy Gradients (TD3), which is one of the state-of-the-art deep reinforcement learning algorithms that can treat deterministic and continuous action spaces, to CBRL. The validation results provide several insights. First, TD3 works as a learning algorithm for CBRL in a simple goal-reaching task. Second, CBRL agents with TD3 can autonomously suppress their exploratory behavior as learning progresses and resume exploration when the environment changes. Finally, examining the effect of the agent's chaoticity on learning shows that there exists a suitable range of chaos strength in the agent's model to flexibly switch between exploration and exploitation and adapt to environmental changes.
comment: Accepted for publication in Neural Networks
♻ ☆ Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training NeurIPS 2025
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
comment: Published at NeurIPS 2025
♻ ☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award). Published in "Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1"
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on Artificial Intelligence (ECAI) 2025
♻ ☆ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE
Efficient Regression-Based Training of Normalizing Flows for Boltzmann Generators ICML
Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple $\ell_2$-regression objective. Specifically, RegFlow maps prior samples under our flow to targets computed using optimal transport couplings or a pre-trained continuous normalizing flow (CNF). To enhance numerical stability, RegFlow employs effective regularization strategies such as a new forward-backward self-consistency loss that enjoys painless implementation. Empirically, we demonstrate that RegFlow unlocks a broader class of architectures that were previously intractable to train for BGs with maximum likelihood. We also show RegFlow exceeds the performance, computational cost, and stability of maximum likelihood training in equilibrium sampling in Cartesian coordinates of alanine dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular systems.
comment: Preprint; ICML GenBio Best Paper Award 2025
♻ ☆ Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English ALT
Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.
comment: ALTA 2025 (Best Paper Honorable Mention). Camera-ready
♻ ☆ Learning to Insert for Constructive Neural Vehicle Routing Solver NeurIPS 2025
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
comment: Accepted at NeurIPS 2025
Self-Evolving Curriculum for LLM Reasoning
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
♻ ☆ SAFE: Multitask Failure Detection for Vision-Language-Action Models NeurIPS 2025
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/
comment: NeurIPS 2025 camera ready. Project Page: https://vla-safe.github.io/
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
♻ ☆ The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), including content generation, human-computer interaction, machine translation, and code generation. However, their widespread deployment has also raised significant safety concerns. In particular, LLM-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts, which has attracted increasing attention from both academia and industry. Although numerous studies have attempted to evaluate these risks, a comprehensive and systematic survey on safety evaluation of LLMs is still lacking. This work aims to fill this gap by presenting a structured overview of recent advances in safety evaluation of LLMs. Specifically, we propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the background of safety evaluation of LLMs, how they differ from general LLMs evaluation, and the significance of such evaluation; (ii) What to evaluate, which examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (iv) How to evaluate, which reviews existing mainstream evaluation methods based on the roles of the evaluators and some evaluation frameworks that integrate the entire evaluation pipeline. Finally, we identify the challenges in safety evaluation of LLMs and propose promising research directions to promote further advancement in this field. We emphasize the necessity of prioritizing safety evaluation to ensure the reliable and responsible deployment of LLMs in real-world applications.
comment: 20 pages, preprint
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce \paper, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. Project Page: https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep learning models on edge devices has become a critical challenge. These devices require real-time inference, low power consumption, and minimal latency. Many framework designers face the conundrum of balancing efficiency and performance. We design a light framework that adopts an encoder-decoder architecture and introduces several key contributions aimed at improving both efficiency and accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency improvement. Moreover, we propose our SPLite decoder. This new architecture significantly boosts the decoding process's frame rate by 3.1x on the Raspberry Pi 5, while maintaining accuracy on par. To further optimize performance, we apply quantization-aware training, reducing memory usage while preserving accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5 CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on compound benchmark datasets, demonstrating comparable accuracy to state-of-the-art approaches while significantly enhancing computational efficiency.
comment: Accepted to AICCC 2025
♻ ☆ LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.
comment: This paper has been accepted by the 16th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2025)
♻ ☆ Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.
comment: 29 pages, 4 figures
♻ ☆ Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking AACL
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.
comment: Accepted to AACL-IJCNLP 2025
♻ ☆ TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation ICML
Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models (LLMs). However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens. With its low token footprint and efficient construction pipeline, TERAG is well-suited for large-scale and cost-sensitive deployment scenarios.
comment: 16 pages, 3 figures, 4 tables. Accepted by the 2026 18th International Conference on Machine Learning and Computing (ICMLC 2026)
♻ ☆ Human-assisted Robotic Policy Refinement via Action Preference Optimization NeurIPS 2025
Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their reliance on offline expert demonstrations critically limits their capacity for post-deployment refinement. To mitigate this limitation, we introduce Action Preference Optimization (APO), a method designed to refine VLA models by human-assisted preference alignment gathered through interaction with environments. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. However, directly leveraging these interaction trajectories for preference optimization is non-trivial due to the challenges of irreversible robotic actions and token distribution mismatch. To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction, empowering VLA models effectively suppress failure-prone actions while enhancing corrective action adaptation. Ultimately, APO equips VLA models with the crucial capability to learn from failure, paving the way for their iterative refinement and reliable deployment in dynamic environments. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our human-assisted framework across a variety of manipulation tasks. We believe this work could bring insights for efficient and stable optimization of VLA models through human-robot collaboration. The code and dataset are released at https://github.com/GeWu-Lab/Action-Preference-Optimization
comment: Accepted By NeurIPS 2025
♻ ☆ Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1x faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.
comment: add baselines, add source code link, update emails
♻ ☆ Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention
Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at https://github.com/EonHao/Holographic-Transformers.
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Cycle Diffusion Model for Counterfactual Image Generation
Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.
♻ ☆ Empowering Agentic Video Analytics Systems with Video Language Models
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
♻ ☆ Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models ACL
Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific "triggers" set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes for consistency with the final output -- any inconsistencies indicating a potential attack. It is well-suited for the popular API-only LLM deployments, enabling detection at minimal cost and with little data. User-friendly and driven by natural language, it allows non-experts to perform the defense independently while maintaining transparency. We validate the effectiveness of CoS through extensive experiments on various tasks and LLMs, with results showing greater benefits for more powerful LLMs.
comment: This paper has been accepted to ACL Findings 2025
♻ ☆ Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
♻ ☆ Let LRMs Break Free from Overthinking via Self-Braking Tuning NeurIPS 2025
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
comment: Accepted to NeurIPS 2025; Camera ready version, 10 pages. Github:https://github.com/ZJU-REAL/Self-Braking-Tuning Project Page: https://ZJU-REAL.github.io/SBT
♻ ☆ DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
comment: Accepted for presentation at the IEEE BigData 2025 Workshop (Special Session on Intelligent Data Mining). This v2 updates formatting and adds IEEE copyright notice
♻ ☆ Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems
This position paper argues that the theoretical inconsistency often observed among Responsible AI (RAI) metrics, such as differing fairness definitions or tradeoffs between accuracy and privacy, should be embraced as a valuable feature rather than a flaw to be eliminated. We contend that navigating these inconsistencies, by treating metrics as divergent objectives, yields three key benefits: (1) Normative Pluralism: Maintaining a full suite of potentially contradictory metrics ensures that the diverse moral stances and stakeholder values inherent in RAI are adequately represented. (2) Epistemological Completeness: The use of multiple, sometimes conflicting, metrics allows for a more comprehensive capture of multifaceted ethical concepts, thereby preserving greater informational fidelity about these concepts than any single, simplified definition. (3) Implicit Regularization: Jointly optimizing for theoretically conflicting objectives discourages overfitting to one specific metric, steering models towards solutions with enhanced generalization and robustness under real-world complexities. In contrast, efforts to enforce theoretical consistency by simplifying or pruning metrics risk narrowing this value diversity, losing conceptual depth, and degrading model performance. We therefore advocate for a shift in RAI theory and practice: from getting trapped in inconsistency to characterizing acceptable inconsistency thresholds and elucidating the mechanisms that permit robust, approximated consistency in practice.
comment: 14 pages,2 figure
♻ ☆ Revealing Multimodal Causality with Large Language Models NeurIPS 2025
Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.
comment: Accepted at NeurIPS 2025
♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
♻ ☆ VerifIoU -- Robustness of Object Detection to Perturbations SC
We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
comment: 44th Digital Avionics Systems Conference (DASC), Sep 2025, Montreal, Canada
♻ ☆ Representative Social Choice: From Learning Theory to AI Alignment NeurIPS 2024
Social choice theory is the study of preference aggregation across a population, used both in mechanism design for human agents and in the democratic alignment of language models. In this study, we propose the representative social choice framework for the modeling of democratic representation in collective decisions, where the number of issues and individuals are too large for mechanisms to consider all preferences directly. These scenarios are widespread in real-world decision-making processes, such as jury trials, legislation, corporate governance, and, more recently, language model alignment. In representative social choice, the population is represented by a finite sample of individual-issue pairs based on which social choice decisions are made. We show that many of the deepest questions in representative social choice can be formulated as statistical learning problems, and prove the generalization properties of social choice mechanisms using the theory of machine learning. We further formulate axioms for representative social choice, and prove Arrow-like impossibility theorems with new combinatorial tools of analysis. Our framework introduces the representative approach to social choice, opening up research directions at the intersection of social choice, learning theory, and AI alignment.
comment: Journal of Artificial Intelligence Research, in press. Best Paper at NeurIPS 2024 Pluralistic Alignment Workshop
♻ ☆ SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
comment: The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
♻ ☆ Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
♻ ☆ Why Isn't Relational Learning Taking Over the World? AAAI-2026
Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can't be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world -- except in a few cases with restricted relations -- and what needs to be done to bring it to it's rightful prominence.
comment: 10 pages (6 pages + references + appendices). To appear AAAI-2026
♻ ☆ Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization ICML 2025
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
comment: 41 pages, 3 figures, 2 tables. Published in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
♻ ☆ A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.
comment: Complementary GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git
♻ ☆ Accelerating Diffusion LLMs via Adaptive Parallel Decoding
The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
comment: 10 pages, 5 figures
♻ ☆ Building Trustworthy AI by Addressing its 16+2 Desiderata with Goal-Directed Commonsense Reasoning
Current advances in AI and its applicability have highlighted the need to ensure its trustworthiness for legal, ethical, and even commercial reasons. Sub-symbolic machine learning algorithms, such as the LLMs, simulate reasoning but hallucinate and their decisions cannot be explained or audited (crucial aspects for trustworthiness). On the other hand, rule-based reasoners, such as Cyc, are able to provide the chain of reasoning steps but are complex and use a large number of reasoners. We propose a middle ground using s(CASP), a goal-directed constraint-based answer set programming reasoner that employs a small number of mechanisms to emulate reliable and explainable human-style commonsense reasoning. In this paper, we explain how s(CASP) supports the 16 desiderata for trustworthy AI introduced by Doug Lenat and Gary Marcus (2023), and two additional ones: inconsistency detection and the assumption of alternative worlds. To illustrate the feasibility and synergies of s(CASP), we present a range of diverse applications, including a conversational chatbot and a virtually embodied reasoner.
♻ ☆ Scalable Best-of-N Selection for Large Language Models via Self-Certainty NeurIPS 2025
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
comment: NeurIPS 2025
♻ ☆ (How) Do Language Models Track State?
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that seem to require tracking the unobserved state of an evolving world. How do they do this? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, and then refines this with an associative scan. LMs that learn the former algorithm tend to generalize better and converge faster, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pre-trained or fine-tuned, can learn to implement efficient and interpretable state-tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
comment: 21 pages, 17 figures, 1 table. Code: http://github.com/belindal/state-tracking
♻ ☆ GAIA: A Foundation Model for Operational Atmospheric Dynamics
We introduce GAIA (Geospatial Artificial Intelligence for Atmospheres), a hybrid self-supervised geospatial foundation model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO) to generate semantically rich representations from global geostationary satellite imagery. Pre-trained on 15 years of globally-merged infrared observations (2001-2015), GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns, as evidenced by distributed principal component structure and temporal coherence analysis. We demonstrate robust reconstruction capabilities across varying data availability (30-95% masking), achieving superior gap-filling performance on real missing data patterns. When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline: improving atmospheric river segmentation (F1: 0.58 vs 0.52), enhancing tropical cyclone detection (storm-level recall: 81% vs 75%, early detection: 29% vs 17%), and maintaining competitive precipitation estimation performance. Analysis reveals that GAIA's hybrid objectives encourage learning of spatially coherent, object-centric features distributed across multiple principal components rather than concentrated representations focused on reconstruction. This work demonstrates that combining complementary self-supervised objectives yields more transferable representations for diverse atmospheric modeling tasks. Model weights and code are available at: https://huggingface.co/bcg-usra-nasa-gaia/GAIA-v1.
comment: 23 pages, 11 figures
♻ ☆ Understanding the Application of Utility Theory in Robotics and Artificial Intelligence: A Survey
As a unifying concept in economics, game theory, and operations research, even in the Robotics and AI field, the utility is used to evaluate the level of individual needs, preferences, and interests. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable utility model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. Although these systems' complex, large-scale, and long-term behaviors are strongly determined by the fundamental characteristics of the underlying relationships, there has been less discussion on the theoretical aspects of mechanisms and the fields of applications in Robotics and AI. This paper introduces a utility-orient needs paradigm to describe and evaluate inter and outer relationships among agents' interactions. Then, we survey existing literature in relevant fields to support it and propose several promising research directions along with some open problems deemed necessary for further investigations.
comment: I am not sure whether withdrawing this paper is suitable. However, right now this paper has significant changes in its topic and author. So, I do not want to lead to any confusion about this paper. In the future, it will have a new version. I hope people will not have issues and confusion about the older one
♻ ☆ Red Teaming AI Red Teaming
Red teaming has evolved from its origins in military applications to become a widely adopted methodology in cybersecurity and AI. In this paper, we take a critical look at the practice of AI red teaming. We argue that despite its current popularity in AI governance, there exists a significant gap between red teaming's original intent as a critical thinking exercise and its narrow focus on discovering model-level flaws in the context of generative AI. Current AI red teaming efforts focus predominantly on individual model vulnerabilities while overlooking the broader sociotechnical systems and emergent behaviors that arise from complex interactions between models, users, and environments. To address this deficiency, we propose a comprehensive framework operationalizing red teaming in AI systems at two levels: macro-level system red teaming spanning the entire AI development lifecycle, and micro-level model red teaming. Drawing on cybersecurity experience and systems theory, we further propose a set of six recommendations. In these, we emphasize that effective AI red teaming requires multifunctional teams that examine emergent risks, systemic vulnerabilities, and the interplay between technical and social factors.
comment: Conference on Applied Machine Learning for Information Security (CAMLIS) 2025
♻ ☆ Dynamic Risk Assessments for Offensive Cybersecurity Agents
Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
comment: 26 pages, 11 figures
♻ ☆ CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams
The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.
♻ ☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
♻ ☆ RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
comment: 6 pages
♻ ☆ Graph Diffusion that can Insert and Delete NeurIPS 2025
Generative models of graphs based on discrete Denoising Diffusion Probabilistic Models (DDPMs) offer a principled approach to molecular generation by systematically removing structural noise through iterative atom and bond adjustments. However, existing formulations are fundamentally limited by their inability to adapt the graph size (that is, the number of atoms) during the diffusion process, severely restricting their effectiveness in conditional generation scenarios such as property-driven molecular design, where the targeted property often correlates with the molecular size. In this paper, we reformulate the noising and denoising processes to support monotonic insertion and deletion of nodes. The resulting model, which we call GrIDDD, dynamically grows or shrinks the chemical graph during generation. GrIDDD matches or exceeds the performance of existing graph diffusion models on molecular property targeting despite being trained on a more difficult problem. Furthermore, when applied to molecular optimization, GrIDDD exhibits competitive performance compared to specialized optimization models. This work paves the way for size-adaptive molecular generation with graph diffusion.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Equation 2 is now correct
♻ ☆ Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude AAAI
Recent advances in Large Language Models (LLMs) have enabled human-like responses across various tasks, raising questions about their ethical decision-making capabilities and potential biases. This study systematically evaluates how nine popular LLMs (both open-source and closed-source) respond to ethical dilemmas involving protected attributes. Across 50,400 trials spanning single and intersectional attribute combinations in four dilemma scenarios (protective vs. harmful), we assess models' ethical preferences, sensitivity, stability, and clustering patterns. Results reveal significant biases in protected attributes in all models, with differing preferences depending on model type and dilemma context. Notably, open-source LLMs show stronger preferences for marginalized groups and greater sensitivity in harmful scenarios, while closed-source models are more selective in protective situations and tend to favor mainstream groups. We also find that ethical behavior varies across dilemma types: LLMs maintain consistent patterns in protective scenarios but respond with more diverse and cognitively demanding decisions in harmful ones. Furthermore, models display more pronounced ethical tendencies under intersectional conditions than in single-attribute settings, suggesting that complex inputs reveal deeper biases. These findings highlight the need for multi-dimensional, context-aware evaluation of LLMs' ethical behavior and offer a systematic evaluation and approach to understanding and addressing fairness in LLM decision-making.
comment: This paper has been accepted by The 20th International AAAI Conference on Web and Social Media (ICWSM 2026), sunny Los Angeles, California
♻ ☆ Towards the Formalization of a Trustworthy AI for Mining Interpretable Models explOiting Sophisticated Algorithms
Interpretable-by-design models are crucial for fostering trust, accountability, and safe adoption of automated decision-making models in real-world applications. In this paper we formalize the ground for the MIMOSA (Mining Interpretable Models explOiting Sophisticated Algorithms) framework, a comprehensive methodology for generating predictive models that balance interpretability with performance while embedding key ethical properties. We formally define here the supervised learning setting across diverse decision-making tasks and data types, including tabular data, time series, images, text, transactions, and trajectories. We characterize three major families of interpretable models: feature importance, rule, and instance based models. For each family, we analyze their interpretability dimensions, reasoning mechanisms, and complexity. Beyond interpretability, we formalize three critical ethical properties, namely causality, fairness, and privacy, providing formal definitions, evaluation metrics, and verification procedures for each. We then examine the inherent trade-offs between these properties and discuss how privacy requirements, fairness constraints, and causal reasoning can be embedded within interpretable pipelines. By evaluating ethical measures during model generation, this framework establishes the theoretical foundations for developing AI systems that are not only accurate and interpretable but also fair, privacy-preserving, and causally aware, i.e., trustworthy.
♻ ☆ PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework. At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image, eliminating the need for multi-stage pipelines or auxiliary modalities. Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics: by conditioning on sparse video keyframes generated by world models, it produces smooth and continuous long-horizon action sequences through an overlap-averaging strategy. This unified design enables scalable and efficient integration of perception and control. On the DREAM dataset, PoseDiff achieves state-of-the-art accuracy and real-time performance for pose estimation. On Libero-Object manipulation tasks, it substantially improves success rates over existing inverse dynamics modules, even under strict offline settings. Together, these results show that PoseDiff provides a scalable, accurate, and efficient bridge between perception, planning, and control in embodied AI. The video visualization results can be found on the project page: https://haozhuo-zhang.github.io/PoseDiff-project-page/.
comment: The experimental setup and metrics lacks rigor, affecting the fairness of the comparisons
♻ ☆ Modelling Emotions in Face-to-Face Setting: The Interplay of Eye-Tracking, Personality, and Temporal Dynamics
Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.
comment: The paper has been significantly revised and my colleague has already submitted the updated version in the link below: arXiv:2510.24720
Machine Learning 274
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ Scaling Image Geo-Localization to Continent Level NeurIPS 2025
Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.
comment: NeurIPS 2025
☆ Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability
We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
comment: 10+13 pages, 8+19 figures
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
comment: Website: https://www.remotelabor.ai
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ A Unified Theory for Causal Inference: Direct Debiased Machine Learning via Bregman-Riesz Regression
This note introduces a unified theory for causal inference that integrates Riesz regression, covariate balancing, density-ratio estimation (DRE), targeted maximum likelihood estimation (TMLE), and the matching estimator in average treatment effect (ATE) estimation. In ATE estimation, the balancing weights and the regression functions of the outcome play important roles, where the balancing weights are referred to as the Riesz representer, bias-correction term, and clever covariates, depending on the context. Riesz regression, covariate balancing, DRE, and the matching estimator are methods for estimating the balancing weights, where Riesz regression is essentially equivalent to DRE in the ATE context, the matching estimator is a special case of DRE, and DRE is in a dual relationship with covariate balancing. TMLE is a method for constructing regression function estimators such that the leading bias term becomes zero. Nearest Neighbor Matching is equivalent to Least Squares Density Ratio Estimation and Riesz Regression.
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
Pre-trained Forecasting Models: Strong Zero-Shot Feature Extractors for Time Series Classification NeurIPS 2025
Recent research on time series foundation models has primarily focused on forecasting, leaving it unclear how generalizable their learned representations are. In this study, we examine whether frozen pre-trained forecasting models can provide effective representations for classification. To this end, we compare different representation extraction strategies and introduce two model-agnostic embedding augmentations. Our experiments show that the best forecasting models achieve classification accuracy that matches or even surpasses that of state-of-the-art models pre-trained specifically for classification. Moreover, we observe a positive correlation between forecasting and classification performance. These findings challenge the assumption that task-specific pre-training is necessary, and suggest that learning to forecast may provide a powerful route toward constructing general-purpose time series foundation models.
comment: NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)
☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
☆ STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
comment: 10 pages main text, 8 pages supplementary material
☆ SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
☆ The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
☆ Bridging the Gap between Empirical Welfare Maximization and Conditional Average Treatment Effect Estimation in Policy Learning
The goal of policy learning is to train a policy function that recommends a treatment given covariates to maximize population welfare. There are two major approaches in policy learning: the empirical welfare maximization (EWM) approach and the plug-in approach. The EWM approach is analogous to a classification problem, where one first builds an estimator of the population welfare, which is a functional of policy functions, and then trains a policy by maximizing the estimated welfare. In contrast, the plug-in approach is based on regression, where one first estimates the conditional average treatment effect (CATE) and then recommends the treatment with the highest estimated outcome. This study bridges the gap between the two approaches by showing that both are based on essentially the same optimization problem. In particular, we prove an exact equivalence between EWM and least squares over a reparameterization of the policy class. As a consequence, the two approaches are interchangeable in several respects and share the same theoretical guarantees under common conditions. Leveraging this equivalence, we propose a novel regularization method for policy learning. Our findings yield a convex and computationally efficient training procedure that avoids the NP-hard combinatorial step typically required in EWM.
☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
☆ On Purely Private Covariance Estimation
We present a simple perturbation mechanism for the release of $d$-dimensional covariance matrices $\Sigma$ under pure differential privacy. For large datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \cite{nikolov2023private}, while simultaneously achieving best known error for all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is information-theoretically optimal for all $p\ge 2$, in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets $n< d^2/\varepsilon$, we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error $O(\sqrt{d\;\text{Tr}(\Sigma) /n})$, improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private} and ${O}\big(d^{3/4}\sqrt{\text{Tr}(\Sigma)/n}\big)$ of \cite{dong2022differentially}.
comment: equal contribution
☆ LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation
A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.
☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
☆ An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$\ discards structural information and performs poorly in practice, while Top-$K$\ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$\ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$\ with the strong performance of Top-$K$.
comment: 8 pages, 2 figures
☆ Value Drifts: Tracing Value Alignment During LLM Post-Training
As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.
☆ Budgeted Multiple-Expert Deferral
Learning to defer uncertain predictions to costly experts offers a powerful strategy for improving the accuracy and efficiency of machine learning systems. However, standard training procedures for deferral algorithms typically require querying all experts for every training instance, an approach that becomes prohibitively expensive when expert queries incur significant computational or resource costs. This undermines the core goal of deferral: to limit unnecessary expert usage. To overcome this challenge, we introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training. We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example. While inspired by active learning, our setting is fundamentally different: labels are already known, and the core challenge is to decide which experts to query in order to balance cost and predictive performance. We establish theoretical guarantees for both of our algorithms, including generalization bounds and label complexity analyses. Empirical results across several domains show that our algorithms substantially reduce training costs without sacrificing prediction accuracy, demonstrating the practical value of our budget-aware deferral algorithms.
☆ How Regularization Terms Make Invertible Neural Networks Bayesian Point Estimators
Can regularization terms in the training of invertible neural networks lead to known Bayesian point estimators in reconstruction? Invertible networks are attractive for inverse problems due to their inherent stability and interpretability. Recently, optimization strategies for invertible neural networks that approximate either a reconstruction map or the forward operator have been studied from a Bayesian perspective, but each has limitations. To address this, we introduce and analyze two regularization terms for the network training that, upon inversion of the network, recover properties of classical Bayesian point estimators: while the first can be connected to the posterior mean, the second resembles the MAP estimator. Our theoretical analysis characterizes how each loss shapes both the learned forward operator and its inverse reconstruction map. Numerical experiments support our findings and demonstrate how these loss-term regularizers introduce data-dependence in a stable and interpretable way.
comment: Preprint, under review
☆ Assessment of the conditional exchangeability assumption in causal machine learning models: a simulation study
Observational studies developing causal machine learning (ML) models for the prediction of individualized treatment effects (ITEs) seldom conduct empirical evaluations to assess the conditional exchangeability assumption. We aimed to evaluate the performance of these models under conditional exchangeability violations and the utility of negative control outcomes (NCOs) as a diagnostic. We conducted a simulation study to examine confounding bias in ITE estimates generated by causal forest and X-learner models under varying conditions, including the presence or absence of true heterogeneity. We simulated data to reflect real-world scenarios with differing levels of confounding, sample size, and NCO confounding structures. We then estimated and compared subgroup-level treatment effects on the primary outcome and NCOs across settings with and without unmeasured confounding. When conditional exchangeability was violated, causal forest and X-learner models failed to recover true treatment effect heterogeneity and, in some cases, falsely indicated heterogeneity when there was none. NCOs successfully identified subgroups affected by unmeasured confounding. Even when NCOs did not perfectly satisfy its ideal assumptions, it remained informative, flagging potential bias in subgroup level estimates, though not always pinpointing the subgroup with the largest confounding. Violations of conditional exchangeability substantially limit the validity of ITE estimates from causal ML models in routinely collected observational data. NCOs serve a useful empirical diagnostic tool for detecting subgroup-specific unmeasured confounding and should be incorporated into causal ML workflows to support the credibility of individualized inference.
☆ Kimi Linear: An Expressive, Efficient Attention Architecture
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
comment: Kimi Linear tech report
☆ LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.
☆ FlowQ-Net: A Generative Framework for Automated Quantum Circuit Design
Designing efficient quantum circuits is a central bottleneck to exploring the potential of quantum computing, particularly for noisy intermediate-scale quantum (NISQ) devices, where circuit efficiency and resilience to errors are paramount. The search space of gate sequences grows combinatorially, and handcrafted templates often waste scarce qubit and depth budgets. We introduce \textsc{FlowQ-Net} (Flow-based Quantum design Network), a generative framework for automated quantum circuit synthesis based on Generative Flow Networks (GFlowNets). This framework learns a stochastic policy to construct circuits sequentially, sampling them in proportion to a flexible, user-defined reward function that can encode multiple design objectives such as performance, depth, and gate count. This approach uniquely enables the generation of a diverse ensemble of high-quality circuits, moving beyond single-solution optimization. We demonstrate the efficacy of \textsc{FlowQ-Net} through an extensive set of simulations. We apply our method to Variational Quantum Algorithm (VQA) ansatz design for molecular ground state estimation, Max-Cut, and image classification, key challenges in near-term quantum computing. Circuits designed by \textsc{FlowQ-Net} achieve significant improvements, yielding circuits that are 10$\times$-30$\times$ more compact in terms of parameters, gates, and depth compared to commonly used unitary baselines, without compromising accuracy. This trend holds even when subjected to error profiles from real-world quantum devices. Our results underline the potential of generative models as a general-purpose methodology for automated quantum circuit design, offering a promising path towards more efficient quantum algorithms and accelerating scientific discovery in the quantum domain.
☆ Tight Differentially Private PCA via Matrix Coherence
We revisit the task of computing the span of the top $r$ singular vectors $u_1, \ldots, u_r$ of a matrix under differential privacy. We show that a simple and efficient algorithm -- based on singular value decomposition and standard perturbation mechanisms -- returns a private rank-$r$ approximation whose error depends only on the \emph{rank-$r$ coherence} of $u_1, \ldots, u_r$ and the spectral gap $\sigma_r - \sigma_{r+1}$. This resolves a question posed by Hardt and Roth~\cite{hardt2013beyond}. Our estimator outperforms the state of the art -- significantly so in some regimes. In particular, we show that in the dense setting, it achieves the same guarantees for single-spike PCA in the Wishart model as those attained by optimal non-private algorithms, whereas prior private algorithms failed to do so. In addition, we prove that (rank-$r$) coherence does not increase under Gaussian perturbations. This implies that any estimator based on the Gaussian mechanism -- including ours -- preserves the coherence of the input. We conjecture that similar behavior holds for other structured models, including planted problems in graphs. We also explore applications of coherence to graph problems. In particular, we present a differentially private algorithm for Max-Cut and other constraint satisfaction problems under low coherence assumptions.
comment: SODA 2026; equal contribution
☆ Action-Driven Processes for Continuous-Time Control
At the heart of reinforcement learning are actions -- decisions made in response to observations of the environment. Actions are equally fundamental in the modeling of stochastic processes, as they trigger discontinuous state transitions and enable the flow of information through large, complex systems. In this paper, we unify the perspectives of stochastic processes and reinforcement learning through action-driven processes, and illustrate their application to spiking neural networks. Leveraging ideas from control-as-inference, we show that minimizing the Kullback-Leibler divergence between a policy-driven true distribution and a reward-driven model distribution for a suitably defined action-driven process is equivalent to maximum entropy reinforcement learning.
☆ Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
☆ Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.
comment: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror (for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation
☆ Curly Flow Matching for Learning Non-gradient Field Dynamics NeurIPS 2025
Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schr\"odinger bridge problem with a non-zero drift reference process -- in stark contrast to typical zero-drift reference processes -- which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: https://github.com/kpetrovicc/curly-flow-matching.git
comment: Accepted to NeurIPS 2025
☆ MSAD: A Deep Dive into Model Selection for Time series Anomaly Detection VLDB
Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines. Preprint version of an article accepted at the VLDB Journal.
comment: 25 pages, 13 figures, VLDB Journal
☆ Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimization
Bayesian Optimization (BO) has the potential to solve various combinatorial tasks, ranging from materials science to neural architecture search. However, BO requires specialized kernels to effectively model combinatorial domains. Recent efforts have introduced several combinatorial kernels, but the relationships among them are not well understood. To bridge this gap, we develop a unifying framework based on heat kernels, which we derive in a systematic way and express as simple closed-form expressions. Using this framework, we prove that many successful combinatorial kernels are either related or equivalent to heat kernels, and validate this theoretical claim in our experiments. Moreover, our analysis confirms and extends the results presented in Bounce: certain algorithms' performance decreases substantially when the unknown optima of the function do not have a certain structure. In contrast, heat kernels are not sensitive to the location of the optima. Lastly, we show that a fast and simple pipeline, relying on heat kernels, is able to achieve state-of-the-art results, matching or even outperforming certain slow or complex algorithms.
☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data research.Our source code and data can be accessed at https://github.com/Flnny/Delay-data
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
☆ Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein Basis
This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters -- the mean and covariance -- are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.
☆ Hybrid Physical-Neural Simulator for Fast Cosmological Hydrodynamics NeurIPS 2025
Cosmological field-level inference requires differentiable forward models that solve the challenging dynamics of gas and dark matter under hydrodynamics and gravity. We propose a hybrid approach where gravitational forces are computed using a differentiable particle-mesh solver, while the hydrodynamics are parametrized by a neural network that maps local quantities to an effective pressure field. We demonstrate that our method improves upon alternative approaches, such as an Enthalpy Gradient Descent baseline, both at the field and summary-statistic level. The approach is furthermore highly data efficient, with a single reference simulation of cosmological structure formation being sufficient to constrain the neural pressure model. This opens the door for future applications where the model is fit directly to observational data, rather than a training set of simulations.
comment: Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
☆ Physics-Informed Mixture Models and Surrogate Models for Precision Additive Manufacturing
In this study, we leverage a mixture model learning approach to identify defects in laser-based Additive Manufacturing (AM) processes. By incorporating physics based principles, we also ensure that the model is sensitive to meaningful physical parameter variations. The empirical evaluation was conducted by analyzing real-world data from two AM processes: Directed Energy Deposition and Laser Powder Bed Fusion. In addition, we also studied the performance of the developed framework over public datasets with different alloy type and experimental parameter information. The results show the potential of physics-guided mixture models to examine the underlying physical behavior of an AM system.
comment: Five pages, four figures, to be presented at the AI in Science Summit, Denmark, November, 2025
☆ Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
☆ Multiclass Local Calibration With the Jensen-Shannon Distance
Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. This is especially relevant in high-stakes settings, such as healthcare, where the sparse instances are exactly those most at risk of biased treatment. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method for enhancing local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.
☆ On Measuring Localization of Shortcuts in Deep Networks
Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.
☆ Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices
Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4-16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
☆ Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
Conventional robots possess a limited understanding of their kinematics and are confined to preprogrammed tasks, hindering their ability to leverage tools efficiently. Driven by the essential components of tool usage - grasping the desired outcome, selecting the most suitable tool, determining optimal tool orientation, and executing precise manipulations - we introduce a pioneering framework. Our novel approach expands the capabilities of the robot's inverse kinematics solver, empowering it to acquire a sequential repertoire of actions using tools of varying lengths. By integrating a simulation-learned action trajectory with the tool, we showcase the practicality of transferring acquired skills from simulation to real-world scenarios through comprehensive experimentation. Remarkably, our extended inverse kinematics solver demonstrates an impressive error rate of less than 1 cm. Furthermore, our trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our model achieves virtually indistinguishable performance when employing two distinct tools of different lengths. This research provides an indication of potential advances in the exploration of all four fundamental aspects of tool usage, enabling robots to master the intricate art of tool manipulation across diverse tasks.
comment: 10 pages, 5 figures. Demonstrates a reinforcement learning framework for adaptive tool manipulation with variable-length extensions
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators' compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ A Three-Stage Bayesian Transfer Learning Framework to Improve Predictions in Data-Scarce Domains
The use of ML in engineering has grown steadily to support a wide array of applications. Among these methods, deep neural networks have been widely adopted due to their performance and accessibility, but they require large, high-quality datasets. Experimental data are often sparse, noisy, or insufficient to build resilient data-driven models. Transfer learning, which leverages relevant data-abundant source domains to assist learning in data-scarce target domains, has shown efficacy. Parameter transfer, where pretrained weights are reused, is common but degrades under large domain shifts. Domain-adversarial neural networks (DANNs) help address this issue by learning domain-invariant representations, thereby improving transfer under greater domain shifts in a semi-supervised setting. However, DANNs can be unstable during training and lack a native means for uncertainty quantification. This study introduces a fully-supervised three-stage framework, the staged Bayesian domain-adversarial neural network (staged B-DANN), that combines parameter transfer and shared latent space adaptation. In Stage 1, a deterministic feature extractor is trained on the source domain. This feature extractor is then adversarially refined using a DANN in Stage 2. In Stage 3, a Bayesian neural network is built on the adapted feature extractor for fine-tuning on the target domain to handle conditional shifts and yield calibrated uncertainty estimates. This staged B-DANN approach was first validated on a synthetic benchmark, where it was shown to significantly outperform standard transfer techniques. It was then applied to the task of predicting critical heat flux in rectangular channels, leveraging data from tube experiments as the source domain. The results of this study show that the staged B-DANN method can improve predictive accuracy and generalization, potentially assisting other domains in nuclear engineering.
comment: Submitted to Engineering Applications of Artificial Intelligence
☆ Higher-Order Regularization Learning on Hypergraphs
Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL's versatility and robustness across diverse learning settings.
☆ Polybasic Speculative Decoding Through a Theoretical Perspective
Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B -- all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
☆ Think Outside the Policy: In-Context Steered Policy Optimization
Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.
comment: Work in progress
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
comment: ICDM 2025 Workshop
☆ LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection
Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.
comment: 27 pages, 6 figures
☆ Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters
Continuous electrocardiogram (ECG) monitoring via wearables offers significant potential for early cardiovascular disease (CVD) detection. However, deploying deep learning models for automated analysis in resource-constrained environments faces reliability challenges due to inevitable Out-of-Distribution (OOD) data. OOD inputs, such as unseen pathologies or noisecorrupted signals, often cause erroneous, high-confidence predictions by standard classifiers, compromising patient safety. Existing OOD detection methods either neglect computational constraints or address noise and unseen classes separately. This paper explores Unsupervised Anomaly Detection (UAD) as an independent, upstream filtering mechanism to improve robustness. We benchmark six UAD approaches, including Deep SVDD, reconstruction-based models, Masked Anomaly Detection, normalizing flows, and diffusion models, optimized via Neural Architecture Search (NAS) under strict resource constraints (at most 512k parameters). Evaluation on PTB-XL and BUT QDB datasets assessed detection of OOD CVD classes and signals unsuitable for analysis due to noise. Results show Deep SVDD consistently achieves the best trade-off between detection and efficiency. In a realistic deployment simulation, integrating the optimized Deep SVDD filter with a diagnostic classifier improved accuracy by up to 21 percentage points over a classifier-only baseline. This study demonstrates that optimized UAD filters can safeguard automated ECG analysis, enabling safer, more reliable continuous cardiovascular monitoring on wearables.
comment: Submitted to the 24th International Conference on Pervasive Computing and Communications (PerCom 2026)
☆ Data-Efficient RLVR via Off-Policy Influence Guidance
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
☆ Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection
Anomaly detection in time-series data is a critical challenge with significant implications for network security. Recent quantum machine learning approaches, such as quantum kernel methods and variational quantum circuits, have shown promise in capturing complex data distributions for anomaly detection but remain constrained by limited qubit counts. We introduce in this work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric gating strategy for robust network anomaly detection. Our model uniquely utilizes a quantum-enhanced generator that outputs parameters (mean and log-variance) of a Gaussian distribution via reparameterization, combined with a Wasserstein critic to stabilize adversarial training. Anomalies are identified through a novel gating mechanism that initially flags potential anomalies based on Gaussian uncertainty estimates and subsequently verifies them using a composite of critic scores and reconstruction errors. Evaluated on benchmark datasets, our method achieves a high time-series aware F1 score (TaF1) of 89.43% demonstrating superior capability in detecting anomalies accurately and promptly as compared to existing classical and quantum models. Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware, where it retained high anomaly detection performance, confirming its robustness and practical feasibility on current noisy intermediate-scale quantum (NISQ) devices.
☆ LINK-KG: LLM-Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.
comment: Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
☆ ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
☆ Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering
Recommender systems often struggle with data sparsity and cold-start scenarios, limiting their ability to provide accurate suggestions for new or infrequent users. This paper presents a Graph Attention Network (GAT) based Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM) driven context aware embeddings. Specifically, we generate concise textual user profiles and unify item metadata (titles, genres, overviews) into rich textual embeddings, injecting these as initial node features in a bipartite user item graph. To further optimize ranking performance, we introduce a hybrid loss function that combines Bayesian Personalized Ranking (BPR) with a cosine similarity term and robust negative sampling, ensuring explicit negative feedback is distinguished from unobserved data. Experiments on the MovieLens 100k and 1M datasets show consistent improvements over state-of-the-art baselines in Precision, NDCG, and MAP while demonstrating robustness for users with limited interaction history. Ablation studies confirm the critical role of LLM-augmented embeddings and the cosine similarity term in capturing nuanced semantic relationships. Our approach effectively mitigates sparsity and cold-start limitations by integrating LLM-derived contextual understanding into graph-based architectures. Future directions include balancing recommendation accuracy with coverage and diversity, and introducing fairness-aware constraints and interpretability features to enhance system performance further.
☆ Robust Graph Condensation via Classification Complexity Mitigation
Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of \ModelName\ across diverse attack scenarios.
☆ Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion
Personalized treatment outcome prediction based on trial data for small-sample and rare patient groups is critical in precision medicine. However, the costly trial data limit the prediction performance. To address this issue, we propose a cross-fidelity knowledge distillation and adaptive fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN incorporates a dual-channel knowledge distillation module to extract complementary knowledge from the low-fidelity model, along with an attention-guided fusion module to dynamically integrate multi-source information. Experiments on treatment outcome prediction for the chronic obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to 74.55\%, and strong robustness to varying high-fidelity dataset sizes. Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the exploration of latent medical semantics to support clinical decision-making.
☆ Co-Evolving Latent Action World Models
Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
☆ Autograder+: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education
The rapid growth of programming education has outpaced traditional assessment tools, leaving faculty with limited means to provide meaningful, scalable feedback. Conventional autograders, while efficient, act as black-box systems that simply return pass/fail results, offering little insight into student thinking or learning needs. Autograder+ is designed to shift autograding from a purely summative process to a formative learning experience. It introduces two key capabilities: automated feedback generation using a fine-tuned Large Language Model, and visualization of student code submissions to uncover learning patterns. The model is fine-tuned on curated student code and expert feedback to ensure pedagogically aligned, context-aware guidance. In evaluation across 600 student submissions from multiple programming tasks, the system produced feedback with strong semantic alignment to instructor comments. For visualization, contrastively learned code embeddings trained on 1,000 annotated submissions enable grouping solutions into meaningful clusters based on functionality and approach. The system also supports prompt-pooling, allowing instructors to guide feedback style through selected prompt templates. By integrating AI-driven feedback, semantic clustering, and interactive visualization, Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes.
☆ Multi-Output Robust and Conjugate Gaussian Processes
Multi-output Gaussian process (MOGP) regression allows modelling dependencies among multiple correlated response variables. Similarly to standard Gaussian processes, MOGPs are sensitive to model misspecification and outliers, which can distort predictions within individual outputs. This situation can be further exacerbated by multiple anomalous response variables whose errors propagate due to correlations between outputs. To handle this situation, we extend and generalise the robust and conjugate Gaussian process (RCGP) framework introduced by Altamirano et al. (2024). This results in the multi-output RCGP (MO-RCGP): a provably robust MOGP that is conjugate, and jointly captures correlations across outputs. We thoroughly evaluate our approach through applications in finance and cancer research.
☆ Multi-Task Learning Based on Support Vector Machines and Twin Support Vector Machines: A Comprehensive Survey
Multi-task learning (MTL) enables simultaneous training across related tasks, leveraging shared information to improve generalization, efficiency, and robustness, especially in data-scarce or high-dimensional scenarios. While deep learning dominates recent MTL research, Support Vector Machines (SVMs) and Twin SVMs (TWSVMs) remain relevant due to their interpretability, theoretical rigor, and effectiveness with small datasets. This chapter surveys MTL approaches based on SVM and TWSVM, highlighting shared representations, task regularization, and structural coupling strategies. Special attention is given to emerging TWSVM extensions for multi-task settings, which show promise but remain underexplored. We compare these models in terms of theoretical properties, optimization strategies, and empirical performance, and discuss applications in fields such as computer vision, natural language processing, and bioinformatics. Finally, we identify research gaps and outline future directions for building scalable, interpretable, and reliable margin-based MTL frameworks. This work provides a comprehensive resource for researchers and practitioners interested in SVM- and TWSVM-based multi-task learning.
☆ Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
☆ Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start'), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
comment: 9 pages, 2 figures, 4 tables
☆ Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings
Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal predictability and major drivers of extreme winter weather. Yet, their accurate and efficient forecast remains a persistent challenge for numerical weather prediction (NWP) systems due to limitations in physical representation, initialization, and the immense computational demands of ensemble forecasts. While data-driven forecasting is rapidly evolving, its application to the complex, three-dimensional dynamics of SSWs, particularly for probabilistic forecast, remains underexplored. Here, we bridge this gap by developing a Flow Matching-based generative AI model (FM-Cast) for efficient and skillful probabilistic forecasting of the spatiotemporal evolution of stratospheric circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast skillfully forecasts the onset, intensity, and morphology of 10 events up to 20 days in advance, achieving ensemble accuracies above 50%. Its performance is comparable to or exceeds leading NWP systems while requiring only two minutes for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging FM-Cast as a scientific tool, we demonstrate through idealized experiments that SSW predictability is fundamentally linked to its underlying physical drivers, distinguishing between events forced from the troposphere and those driven by internal stratospheric dynamics. Our work thus establishes a computationally efficient paradigm for probabilistic forecasting stratospheric anomalies and showcases generative AI's potential to deepen the physical understanding of atmosphere-climate dynamics.
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method's effectiveness for real-world applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ Towards Explainable and Reliable AI in Finance
Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emph{First}, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emph{Second}, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.
☆ UnifiedFL: A Dynamic Unified Learning Framework for Equitable Federation
Federated learning (FL) has emerged as a key paradigm for collaborative model training across multiple clients without sharing raw data, enabling privacy-preserving applications in areas such as radiology and pathology. However, works on collaborative training across clients with fundamentally different neural architectures and non-identically distributed datasets remain scarce. Existing FL frameworks face several limitations. Despite claiming to support architectural heterogeneity, most recent FL methods only tolerate variants within a single model family (e.g., shallower, deeper, or wider CNNs), still presuming a shared global architecture and failing to accommodate federations where clients deploy fundamentally different network types (e.g., CNNs, GNNs, MLPs). Moreover, existing approaches often address only statistical heterogeneity while overlooking the domain-fracture problem, where each client's data distribution differs markedly from that faced at testing time, undermining model generalizability. When clients use different architectures, have non-identically distributed data, and encounter distinct test domains, current methods perform poorly. To address these challenges, we propose UnifiedFL, a dynamic federated learning framework that represents heterogeneous local networks as nodes and edges in a directed model graph optimized by a shared graph neural network (GNN). UnifiedFL introduces (i) a common GNN to parameterize all architectures, (ii) distance-driven clustering via Euclidean distances between clients' parameters, and (iii) a two-tier aggregation policy balancing convergence and diversity. Experiments on MedMNIST classification and hippocampus segmentation benchmarks demonstrate UnifiedFL's superior performance. Code and data: https://github.com/basiralab/UnifiedFL
☆ Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle
Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
☆ Linear Causal Discovery with Interventional Constraints
Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks such as designing new treatments. In this paper, we introduce a novel concept in causal discovery, termed interventional constraints, which differs fundamentally from interventional data. While interventional data require direct perturbations of variables, interventional constraints encode high-level causal knowledge in the form of inequality constraints on causal effects. For instance, in the Sachs dataset (Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3 exerts a positive causal effect on Akt. Existing causal discovery methods allow enforcing structural constraints (for example, requiring a causal path from PIP3 to Akt), but they may still produce incorrect causal conclusions such as learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap by explicitly constraining the total causal effect between variable pairs, ensuring learned models respect known causal influences. To formalize interventional constraints, we propose a metric to quantify total causal effects for linear causal models and formulate the problem as a constrained optimization task, solved using a two-stage constrained optimization method. We evaluate our approach on real-world datasets and demonstrate that integrating interventional constraints not only improves model accuracy and ensures consistency with established findings, making models more explainable, but also facilitates the discovery of new causal relationships that would otherwise be costly to identify.
☆ SABER: Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator
Accurate Angle-of-arrival (AoA) estimation is essential for next-generation wireless communication systems to enable reliable beamforming, high-precision localization, and integrated sensing. Unfortunately, classical high-resolution techniques require multi-element arrays and extensive snapshot collection, while generic Machine Learning (ML) approaches often yield black-box models that lack physical interpretability. To address these limitations, we propose a Symbolic Regression (SR)-based ML framework. Namely, Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator (SABER), a constrained symbolic-regression framework that automatically discovers closed-form beam pattern and AoA models from path loss measurements with interpretability. SABER achieves high accuracy while bridging the gap between opaque ML methods and interpretable physics-driven estimators. First, we validate our approach in a controlled free-space anechoic chamber, showing that both direct inversion of the known $\cos^n$ beam and a low-order polynomial surrogate achieve sub-0.5 degree Mean Absolute Error (MAE). A purely unconstrained SR method can further reduce the error of the predicted angles, but produces complex formulas that lack physical insight. Then, we implement the same SR-learned inversions in a real-world, Reconfigurable Intelligent Surface (RIS)-aided indoor testbed. SABER and unconstrained SR models accurately recover the true AoA with near-zero error. Finally, we benchmark SABER against the Cram\'er-Rao Lower Bounds (CRLBs). Our results demonstrate that SABER is an interpretable and accurate alternative to state-of-the-art and black-box ML-based methods for AoA estimation.
comment: 12 pages, 11 figures
☆ Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
Enabling continual learning in LLMs remains a key unresolved research challenge. In a recent announcement, a frontier LLM company made a step towards this by introducing Agent Skills, a framework that equips agents with new knowledge based on instructions stored in simple markdown files. Although Agent Skills can be a very useful tool, we show that they are fundamentally insecure, since they enable trivially simple prompt injections. We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data, such as internal files or passwords. Importantly, we show how to bypass system-level guardrails of a popular coding agent: a benign, task-specific approval with the "Don't ask again" option can carry over to closely related but harmful actions. Overall, we conclude that despite ongoing research efforts and scaling model capabilities, frontier LLMs remain vulnerable to very simple prompt injections in realistic scenarios. Our code is available at https://github.com/aisa-group/promptinject-agent-skills.
☆ Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics NeurIPS 2025
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and a good approximation to the prior $p(x)$, when can we sample from the posterior $p(x \mid y)$? Posterior sampling provides an accurate and fair framework for tasks such as inpainting, deblurring, and MRI reconstruction, and several heuristics attempt to approximate it. Unfortunately, approximate posterior sampling is computationally intractable in general. To sidestep this hardness, we focus on (local or global) log-concave distributions $p(x)$. In this regime, Langevin dynamics yields posterior samples when the exact scores of $p(x)$ are available, but it is brittle to score--estimation error, requiring an MGF bound (sub-exponential error). By contrast, in the unconditional setting, diffusion models succeed with only an $L^2$ bound on the score error. We prove that combining diffusion models with an annealed variant of Langevin dynamics achieves conditional sampling in polynomial time using merely an $L^4$ bound on the score error.
comment: NeurIPS 2025
☆ On the Impact of Weight Discretization in QUBO-Based SVM Training ECML
Training Support Vector Machines (SVMs) can be formulated as a QUBO problem, enabling the use of quantum annealing for model optimization. In this work, we study how the number of qubits - linked to the discretization level of dual weights - affects predictive performance across datasets. We compare QUBO-based SVM training to the classical LIBSVM solver and find that even low-precision QUBO encodings (e.g., 1 bit per parameter) yield competitive, and sometimes superior, accuracy. While increased bit-depth enables larger regularization parameters, it does not always improve classification. Our findings suggest that selecting the right support vectors may matter more than their precise weighting. Although current hardware limits the size of solvable QUBOs, our results highlight the potential of quantum annealing for efficient SVM training as quantum devices scale.
comment: Presented at the 7th DSO Workshop at ECML PKDD 2025
☆ Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning NeurIPS 2025
Continual learning (CL) aims to incrementally train a model on a sequence of tasks while retaining performance on prior ones. However, storing and replaying data is often infeasible due to privacy or security constraints and impractical for arbitrary pre-trained models. Data-free CL seeks to update models without access to previous data. Beyond regularization, we employ model inversion to synthesize data from the trained model, enabling replay without storing samples. Yet, model inversion in predictive models faces two challenges: (1) generating inputs solely from compressed output labels causes drift between synthetic and real data, and replaying such data can erode prior knowledge; (2) inversion is computationally expensive since each step backpropagates through the full model. These issues are amplified in large pre-trained models such as CLIP. To improve efficiency, we propose Per-layer Model Inversion (PMI), inspired by faster convergence in single-layer optimization. PMI provides strong initialization for full-model inversion, substantially reducing iterations. To mitigate feature shift, we model class-wise features via Gaussian distributions and contrastive model, ensuring alignment between synthetic and real features. Combining PMI and feature modeling, our approach enables continual learning of new classes by generating pseudo-images from semantic-aware projected features, achieving strong effectiveness and compatibility across multiple CL settings.
comment: Accepted in NeurIPS 2025
☆ A Survey of Heterogeneous Graph Neural Networks for Cybersecurity Anomaly Detection
Anomaly detection is a critical task in cybersecurity, where identifying insider threats, access violations, and coordinated attacks is essential for ensuring system resilience. Graph-based approaches have become increasingly important for modeling entity interactions, yet most rely on homogeneous and static structures, which limits their ability to capture the heterogeneity and temporal evolution of real-world environments. Heterogeneous Graph Neural Networks (HGNNs) have emerged as a promising paradigm for anomaly detection by incorporating type-aware transformations and relation-sensitive aggregation, enabling more expressive modeling of complex cyber data. However, current research on HGNN-based anomaly detection remains fragmented, with diverse modeling strategies, limited comparative evaluation, and an absence of standardized benchmarks. To address this gap, we provide a comprehensive survey of HGNN-based anomaly detection methods in cybersecurity. We introduce a taxonomy that classifies approaches by anomaly type and graph dynamics, analyze representative models, and map them to key cybersecurity applications. We also review commonly used benchmark datasets and evaluation metrics, highlighting their strengths and limitations. Finally, we identify key open challenges related to modeling, data, and deployment, and outline promising directions for future research. This survey aims to establish a structured foundation for advancing HGNN-based anomaly detection toward scalable, interpretable, and practically deployable solutions.
comment: 37 pages, 4 figures, 86 references. Submitted to Journal of Computer Security (under review)
☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$ close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
comment: 50 pages
☆ Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
☆ Offline Clustering of Preference Learning with Active-data Augmentation
Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.
☆ Empirical Bayesian Multi-Bandit Learning
Multi-task learning in contextual bandits has attracted significant research interest due to its potential to enhance decision-making across multiple related tasks by leveraging shared structures and task-specific heterogeneity. In this article, we propose a novel hierarchical Bayesian framework for learning in various bandit instances. This framework captures both the heterogeneity and the correlations among different bandit instances through a hierarchical Bayesian model, enabling effective information sharing while accommodating instance-specific variations. Unlike previous methods that overlook the learning of the covariance structure across bandits, we introduce an empirical Bayesian approach to estimate the covariance matrix of the prior distribution.This enhances both the practicality and flexibility of learning across multi-bandits. Building on this approach, we develop two efficient algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which incorporate the estimated prior into the decision-making process. We provide the frequentist regret upper bounds for the proposed algorithms, thereby filling a research gap in the field of multi-bandit problems. Extensive experiments on both synthetic and real-world datasets demonstrate the superior performance of our algorithms, particularly in complex environments. Our methods achieve lower cumulative regret compared to existing techniques, highlighting their effectiveness in balancing exploration and exploitation across multi-bandits.
comment: 33 pages, 13 figures
☆ Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation
Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.
☆ A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 "Software Engineering 2030" workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan's tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software products.The resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area. Based on these findings, the article finally makes ten predictions for SE in the year 2030.
☆ PVMark: Enabling Public Verifiability for LLM Watermarking Schemes
Watermarking schemes for large language models (LLMs) have been proposed to identify the source of the generated text, mitigating the potential threats emerged from model theft. However, current watermarking solutions hardly resolve the trust issue: the non-public watermark detection cannot prove itself faithfully conducting the detection. We observe that it is attributed to the secret key mostly used in the watermark detection -- it cannot be public, or the adversary may launch removal attacks provided the key; nor can it be private, or the watermarking detection is opaque to the public. To resolve the dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP), enabling the watermark detection process to be publicly verifiable by third parties without disclosing any secret key. PVMark hinges upon the proof of `correct execution' of watermark detection on which a set of ZKP constraints are built, including mapping, random number generation, comparison, and summation. We implement multiple variants of PVMark in Python, Rust and Circom, covering combinations of three watermarking schemes, three hash functions, and four ZKP protocols, to show our approach effectively works under a variety of circumstances. By experimental results, PVMark efficiently enables public verifiability on the state-of-the-art LLM watermarking schemes yet without compromising the watermarking performance, promising to be deployed in practice.
comment: This work has been submitted to the IEEE for possible publication
☆ Likely Interpolants of Generative Models
Interpolation in generative models allows for controlled generation, model inspection, and more. Unfortunately, most generative models lack a principal notion of interpolants without restrictive assumptions on either the model or data dimension. In this paper, we develop a general interpolation scheme that targets likely transition paths compatible with different metrics and probability distributions. We consider interpolants analogous to a geodesic constrained to a suitable data distribution and derive a novel algorithm for computing these curves, which requires no additional training. Theoretically, we show that our method locally can be considered as a geodesic under a suitable Riemannian metric. We quantitatively show that our interpolation scheme traverses higher density regions than baselines across a range of models and datasets.
☆ Angular Steering: Behavior Control via Rotation in Activation Space NeurIPS 2025
Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at https://github.com/lone17/angular-steering/.
comment: NeurIPS 2025 (Spotlight)
☆ MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines
As a new and promising approach, existing machine unlearning (MU) works typically emphasize theoretical formulations or optimization objectives to achieve knowledge removal. However, when deployed in real-world scenarios, such solutions typically face scalability issues and have to address practical requirements such as full access to original datasets and model. In contrast to the existing approaches, we regard classification training as a sequential process where classes are learned sequentially, which we call \emph{inductive approach}. Unlearning can then be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model. Such an approach does not require full access to the original dataset or the model, addressing the challenges of existing methods. This enables modular and model-agnostic deployment as an output filter into existing classification pipelines with minimal alterations. We conducted multiple experiments across multiple datasets including image (CIFAR-10/100 using CNN-based model) and tabular datasets (Covertype using tree-based model). Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction. This demonstrates the applicability, scalability, and system compatibility of our solution while maintaining the performance of the output in a more practical setting.
comment: 10 pages, 6 figures
☆ Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.
comment: 21 pages, 8 figures
☆ Predicting All-Cause Hospital Readmissions from Medical Claims Data of Hospitalised Patients
Reducing preventable hospital readmissions is a national priority for payers, providers, and policymakers seeking to improve health care and lower costs. The rate of readmission is being used as a benchmark to determine the quality of healthcare provided by the hospitals. In thisproject, we have used machine learning techniques like Logistic Regression, Random Forest and Support Vector Machines to analyze the health claims data and identify demographic and medical factors that play a crucial role in predicting all-cause readmissions. As the health claims data is high dimensional, we have used Principal Component Analysis as a dimension reduction technique and used the results for building regression models. We compared and evaluated these models based on the Area Under Curve (AUC) metric. Random Forest model gave the highest performance followed by Logistic Regression and Support Vector Machine models. These models can be used to identify the crucial factors causing readmissions and help identify patients to focus on to reduce the chances of readmission, ultimately bringing down the cost and increasing the quality of healthcare provided to the patients.
comment: NCMLAI 2018
☆ Accumulative SGD Influence Estimation for Data Attribution
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.
☆ A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation
Public resource allocation involves the efficient distribution of resources, including urban infrastructure, energy, and transportation, to effectively meet societal demands. However, existing methods focus on optimizing the movement of individual resources independently, without considering their capacity constraints. To address this limitation, we propose a novel and more practical problem: Collaborative Public Resource Allocation (CPRA), which explicitly incorporates capacity constraints and spatio-temporal dynamics in real-world scenarios. We propose a new framework called Game-Theoretic Spatio-Temporal Reinforcement Learning (GSTRL) for solving CPRA. Our contributions are twofold: 1) We formulate the CPRA problem as a potential game and demonstrate that there is no gap between the potential function and the optimal target, laying a solid theoretical foundation for approximating the Nash equilibrium of this NP-hard problem; and 2) Our designed GSTRL framework effectively captures the spatio-temporal dynamics of the overall system. We evaluate GSTRL on two real-world datasets, where experiments show its superior performance. Our source codes are available in the supplementary materials.
☆ Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series
In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility.
comment: This paper is currently under review for presentation at the IEEE SAMI 2026 Conference
☆ Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment EMNLP 2025
Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule-text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule-description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, highlighting the significance of substructure-aware alignment in molecule-text learning.
comment: EMNLP 2025 (main)
☆ STAR: A Privacy-Preserving, Energy-Efficient Edge AI Framework for Human Activity Recognition via Wi-Fi CSI in Mobile and Pervasive Computing Environments
Human Activity Recognition (HAR) via Wi-Fi Channel State Information (CSI) presents a privacy-preserving, contactless sensing approach suitable for smart homes, healthcare monitoring, and mobile IoT systems. However, existing methods often encounter computational inefficiency, high latency, and limited feasibility within resource-constrained, embedded mobile edge environments. This paper proposes STAR (Sensing Technology for Activity Recognition), an edge-AI-optimized framework that integrates a lightweight neural architecture, adaptive signal processing, and hardware-aware co-optimization to enable real-time, energy-efficient HAR on low-power embedded devices. STAR incorporates a streamlined Gated Recurrent Unit (GRU)-based recurrent neural network, reducing model parameters by 33% compared to conventional LSTM models while maintaining effective temporal modeling capability. A multi-stage pre-processing pipeline combining median filtering, 8th-order Butterworth low-pass filtering, and Empirical Mode Decomposition (EMD) is employed to denoise CSI amplitude data and extract spatial-temporal features. For on-device deployment, STAR is implemented on a Rockchip RV1126 processor equipped with an embedded Neural Processing Unit (NPU), interfaced with an ESP32-S3-based CSI acquisition module. Experimental results demonstrate a mean recognition accuracy of 93.52% across seven activity classes and 99.11% for human presence detection, utilizing a compact 97.6k-parameter model. INT8 quantized inference achieves a processing speed of 33 MHz with just 8% CPU utilization, delivering sixfold speed improvements over CPU-based execution. With sub-second response latency and low power consumption, the system ensures real-time, privacy-preserving HAR, offering a practical, scalable solution for mobile and pervasive computing environments.
☆ maxVSTAR: Maximally Adaptive Vision-Guided CSI Sensing with Closed-Loop Edge Model Adaptation for Robust Human Activity Recognition
WiFi Channel State Information (CSI)-based human activity recognition (HAR) provides a privacy-preserving, device-free sensing solution for smart environments. However, its deployment on edge devices is severely constrained by domain shift, where recognition performance deteriorates under varying environmental and hardware conditions. This study presents maxVSTAR (maximally adaptive Vision-guided Sensing Technology for Activity Recognition), a closed-loop, vision-guided model adaptation framework that autonomously mitigates domain shift for edge-deployed CSI sensing systems. The proposed system integrates a cross-modal teacher-student architecture, where a high-accuracy YOLO-based vision model serves as a dynamic supervisory signal, delivering real-time activity labels for the CSI data stream. These labels enable autonomous, online fine-tuning of a lightweight CSI-based HAR model, termed Sensing Technology for Activity Recognition (STAR), directly at the edge. This closed-loop retraining mechanism allows STAR to continuously adapt to environmental changes without manual intervention. Extensive experiments demonstrate the effectiveness of maxVSTAR. When deployed on uncalibrated hardware, the baseline STAR model's recognition accuracy declined from 93.52% to 49.14%. Following a single vision-guided adaptation cycle, maxVSTAR restored the accuracy to 81.51%. These results confirm the system's capacity for dynamic, self-supervised model adaptation in privacy-conscious IoT environments, establishing a scalable and practical paradigm for long-term autonomous HAR using CSI sensing at the network edge.
☆ Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.
comment: Pre-print prepared for journal submission
☆ Uncertainty-Aware Diagnostics for Physics-Informed Machine Learning
Physics-informed machine learning (PIML) integrates prior physical information, often in the form of differential equation constraints, into the process of fitting machine learning models to physical data. Popular PIML approaches, including neural operators, physics-informed neural networks, neural ordinary differential equations, and neural discrete equilibria, are typically fit to objectives that simultaneously include both data and physical constraints. However, the multi-objective nature of this approach creates ambiguity in the measurement of model quality. This is related to a poor understanding of epistemic uncertainty, and it can lead to surprising failure modes, even when existing statistical metrics suggest strong fits. Working within a Gaussian process regression framework, we introduce the Physics-Informed Log Evidence (PILE) score. Bypassing the ambiguities of test losses, the PILE score is a single, uncertainty-aware metric that provides a selection principle for hyperparameters of a PIML model. We show that PILE minimization yields excellent choices for a wide variety of model parameters, including kernel bandwidth, least squares regularization weights, and even kernel function selection. We also show that, even prior to data acquisition, a special 'data-free' case of the PILE score identifies a priori kernel choices that are 'well-adapted' to a given PDE. Beyond the kernel setting, we anticipate that the PILE score can be extended to PIML at large, and we outline approaches to do so.
☆ Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.
comment: Work in progress
☆ SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe
☆ Robust Super-Capacity SRS Channel Inpainting via Diffusion Models
Accurate channel state information (CSI) is essential for reliable multiuser MIMO operation. In 5G NR, reciprocity-based beamforming via uplink Sounding Reference Signals (SRS) face resource and coverage constraints, motivating sparse non-uniform SRS allocation. Prior masked-autoencoder (MAE) approaches improve coverage but overfit to training masks and degrade under unseen distortions (e.g., additional masking, interference, clipping, non-Gaussian noise). We propose a diffusion-based channel inpainting framework that integrates system-model knowledge at inference via a likelihood-gradient term, enabling a single trained model to adapt across mismatched conditions. On standardized CDL channels, the score-based diffusion variant consistently outperforms a UNet score-model baseline and the one-step MAE under distribution shift, with improvements up to 14 dB NMSE in challenging settings (e.g., Laplace noise, user interference), while retaining competitive accuracy under matched conditions. These results demonstrate that diffusion-guided inpainting is a robust and generalizable approach for super-capacity SRS design in 5G NR systems.
☆ ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models NeurIPS 2025
Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model's utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.
comment: Accepted to NeurIPS 2025
☆ Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.
☆ Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing
Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi-vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi-agent reinforcement learning (MARL) framework for coordinated, network-aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub-based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles micro-routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow-aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network-constrained MARL for scalable, coordinated, and congestion-aware routing in intelligent transportation systems.
comment: 29 pages, 12 figures. Fazel Arasteh and Arian Haghparast contributed equally to this research. Submitted to ACM Transactions on Spatial Algorithms and Systems (TSAS). The code for this work is publicly available at https://github.com/Arianhgh/HHAN
☆ LLMBisect: Breaking Barriers in Bug Bisection with A Comparative Analysis Pipeline
Bug bisection has been an important security task that aims to understand the range of software versions impacted by a bug, i.e., identifying the commit that introduced the bug. However, traditional patch-based bisection methods are faced with several significant barriers: For example, they assume that the bug-inducing commit (BIC) and the patch commit modify the same functions, which is not always true. They often rely solely on code changes, while the commit message frequently contains a wealth of vulnerability-related information. They are also based on simple heuristics (e.g., assuming the BIC initializes lines deleted in the patch) and lack any logical analysis of the vulnerability. In this paper, we make the observation that Large Language Models (LLMs) are well-positioned to break the barriers of existing solutions, e.g., comprehend both textual data and code in patches and commits. Unlike previous BIC identification approaches, which yield poor results, we propose a comprehensive multi-stage pipeline that leverages LLMs to: (1) fully utilize patch information, (2) compare multiple candidate commits in context, and (3) progressively narrow down the candidates through a series of down-selection steps. In our evaluation, we demonstrate that our approach achieves significantly better accuracy than the state-of-the-art solution by more than 38\%. Our results further confirm that the comprehensive multi-stage pipeline is essential, as it improves accuracy by 60\% over a baseline LLM-based bisection method.
☆ Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Specialized Generalist Models (SGMs) aim to preserve broad capabilities while achieving expert-level performance in target domains. However, traditional LLM structures including Transformer, Linear Attention, and hybrid models do not employ specialized memory mechanism guided by task information. In this paper, we present Nirvana, an SGM with specialized memory mechanism, linear time complexity, and test-time task information extraction. Besides, we propose the Task-Aware Memory Trigger ($\textit{Trigger}$) that flexibly adjusts memory mechanism based on the current task's requirements. In Trigger, each incoming sample is treated as a self-supervised fine-tuning task, enabling Nirvana to adapt its task-related parameters on the fly to domain shifts. We also design the Specialized Memory Updater ($\textit{Updater}$) that dynamically memorizes the context guided by Trigger. We conduct experiments on both general language tasks and specialized medical tasks. On a variety of natural language modeling benchmarks, Nirvana achieves competitive or superior results compared to the existing LLM structures. To prove the effectiveness of Trigger on specialized tasks, we test Nirvana's performance on a challenging medical task, i.e., Magnetic Resonance Imaging (MRI). We post-train frozen Nirvana backbone with lightweight codecs on paired electromagnetic signals and MRI images. Despite the frozen Nirvana backbone, Trigger guides the model to adapt to the MRI domain with the change of task-related parameters. Nirvana achieves higher-quality MRI reconstruction compared to conventional MRI models as well as the models with traditional LLMs' backbone, and can also generate accurate preliminary clinical reports accordingly.
☆ New Money: A Systematic Review of Synthetic Data Generation for Finance
Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.
comment: 37 pages, 5 figures, 21 tables
☆ Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization
This paper proposes a novel paradigm for machine learning that moves beyond traditional parameter optimization. Unlike conventional approaches that search for optimal parameters within a fixed geometric space, our core idea is to treat the model itself as a malleable geometric entity. Specifically, we optimize the metric tensor field on a manifold with a predefined topology, thereby dynamically shaping the geometric structure of the model space. To achieve this, we construct a variational framework whose loss function carefully balances data fidelity against the intrinsic geometric complexity of the manifold. The former ensures the model effectively explains observed data, while the latter acts as a regularizer, penalizing overly curved or irregular geometries to encourage simpler models and prevent overfitting. To address the computational challenges of this infinite-dimensional optimization problem, we introduce a practical method based on discrete differential geometry: the continuous manifold is discretized into a triangular mesh, and the metric tensor is parameterized by edge lengths, enabling efficient optimization using automatic differentiation tools. Theoretical analysis reveals a profound analogy between our framework and the Einstein-Hilbert action in general relativity, providing an elegant physical interpretation for the concept of "data-driven geometry". We further argue that even with fixed topology, metric optimization offers significantly greater expressive power than models with fixed geometry. This work lays a solid foundation for constructing fully dynamic "meta-learners" capable of autonomously evolving their geometry and topology, and it points to broad application prospects in areas such as scientific model discovery and robust representation learning.
comment: 9 pages
☆ Towards Scaling Laws for Symbolic Regression NeurIPS 2025
Symbolic regression (SR) aims to discover the underlying mathematical expressions that explain observed data. This holds promise for both gaining scientific insight and for producing inherently interpretable and generalizable models for tabular data. In this work we focus on the basics of SR. Deep learning-based SR has recently become competitive with genetic programming approaches, but the role of scale has remained largely unexplored. Inspired by scaling laws in language modeling, we present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline and carefully generated training data. Across five different model sizes and spanning three orders of magnitude in compute, we find that both validation loss and solved rate follow clear power-law trends with compute. We further identify compute-optimal hyperparameter scaling: optimal batch size and learning rate grow with model size, and a token-to-parameter ratio of $\approx$15 is optimal in our regime, with a slight upward trend as compute increases. These results demonstrate that SR performance is largely predictable from compute and offer important insights for training the next generation of SR models.
comment: Accepted at the NeurIPS 2025 Math-AI Workshop
☆ Data-driven Projection Generation for Efficiently Solving Heterogeneous Quadratic Programming Problems
We propose a data-driven framework for efficiently solving quadratic programming (QP) problems by reducing the number of variables in high-dimensional QPs using instance-specific projection. A graph neural network-based model is designed to generate projections tailored to each QP instance, enabling us to produce high-quality solutions even for previously unseen problems. The model is trained on heterogeneous QPs to minimize the expected objective value evaluated on the projected solutions. This is formulated as a bilevel optimization problem; the inner optimization solves the QP under a given projection using a QP solver, while the outer optimization updates the model parameters. We develop an efficient algorithm to solve this bilevel optimization problem, which computes parameter gradients without backpropagating through the solver. We provide a theoretical analysis of the generalization ability of solving QPs with projection matrices generated by neural networks. Experimental results demonstrate that our method produces high-quality feasible solutions with reduced computation time, outperforming existing methods.
☆ Bias-Corrected Data Synthesis for Imbalanced Learning
Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.
comment: 41 pages, 4 figures, includes proofs and appendix
☆ $L_1$-norm Regularized Indefinite Kernel Logistic Regression
Kernel logistic regression (KLR) is a powerful classification method widely applied across diverse domains. In many real-world scenarios, indefinite kernels capture more domain-specific structural information than positive definite kernels. This paper proposes a novel $L_1$-norm regularized indefinite kernel logistic regression (RIKLR) model, which extends the existing IKLR framework by introducing sparsity via an $L_1$-norm penalty. The introduction of this regularization enhances interpretability and generalization while introducing nonsmoothness and nonconvexity into the optimization landscape. To address these challenges, a theoretically grounded and computationally efficient proximal linearized algorithm is developed. Experimental results on multiple benchmark datasets demonstrate the superior performance of the proposed method in terms of both accuracy and sparsity.
comment: 17 pages, 1 figure
☆ Accelerating Real-World Overtaking in F1TENTH Racing Employing Reinforcement Learning Methods
While autonomous racing performance in Time-Trial scenarios has seen significant progress and development, autonomous wheel-to-wheel racing and overtaking are still severely limited. These limitations are particularly apparent in real-life driving scenarios where state-of-the-art algorithms struggle to safely or reliably complete overtaking manoeuvres. This is important, as reliable navigation around other vehicles is vital for safe autonomous wheel-to-wheel racing. The F1Tenth Competition provides a useful opportunity for developing wheel-to-wheel racing algorithms on a standardised physical platform. The competition format makes it possible to evaluate overtaking and wheel-to-wheel racing algorithms against the state-of-the-art. This research presents a novel racing and overtaking agent capable of learning to reliably navigate a track and overtake opponents in both simulation and reality. The agent was deployed on an F1Tenth vehicle and competed against opponents running varying competitive algorithms in the real world. The results demonstrate that the agent's training against opponents enables deliberate overtaking behaviours with an overtaking rate of 87% compared 56% for an agent trained just to race.
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing'' capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
☆ Overspecified Mixture Discriminant Analysis: Exponential Convergence, Statistical Guarantees, and Remote Sensing Applications
This study explores the classification error of Mixture Discriminant Analysis (MDA) in scenarios where the number of mixture components exceeds those present in the actual data distribution, a condition known as overspecification. We use a two-component Gaussian mixture model within each class to fit data generated from a single Gaussian, analyzing both the algorithmic convergence of the Expectation-Maximization (EM) algorithm and the statistical classification error. We demonstrate that, with suitable initialization, the EM algorithm converges exponentially fast to the Bayes risk at the population level. Further, we extend our results to finite samples, showing that the classification error converges to Bayes risk with a rate $n^{-1/2}$ under mild conditions on the initial parameter estimates and sample size. This work provides a rigorous theoretical framework for understanding the performance of overspecified MDA, which is often used empirically in complex data settings, such as image and text classification. To validate our theory, we conduct experiments on remote sensing datasets.
☆ Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement
Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA's Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning. Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25\%) and query rephrasal errors (3.2\%). Using NVIDIA NeMo microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96\% accuracy, a 10x reduction in model size, and 70\% latency improvement. For query rephrasal, fine-tuning yielded a 3.7\% gain in accuracy and a 40\% latency reduction. Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.
comment: 20 pages, 5 figures, 5 tables. Presents MAPE-K control loop application to enterprise AI agent improvement with experimental validation on NVIDIA's NVInfo AI system
☆ Accelerating Radiative Transfer for Planetary Atmospheres by Orders of Magnitude with a Transformer-Based Machine Learning Model
Radiative transfer calculations are essential for modeling planetary atmospheres. However, standard methods are computationally demanding and impose accuracy-speed trade-offs. High computational costs force numerical simplifications in large models (e.g., General Circulation Models) that degrade the accuracy of the simulation. Radiative transfer calculations are an ideal candidate for machine learning emulation: fundamentally, it is a well-defined physical mapping from a static atmospheric profile to the resulting fluxes, and high-fidelity training data can be created from first principles calculations. We developed a radiative transfer emulator using an encoder-only transformer neural network architecture, trained on 1D profiles representative of solar-composition hot Jupiter atmospheres. Our emulator reproduced bolometric two-stream layer fluxes with mean test set errors of ~1% compared to the traditional method and achieved speedups of 100x. Emulating radiative transfer with machine learning opens up the possibility for faster and more accurate routines within planetary atmospheric models such as GCMs.
☆ Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
☆ e1: Learning Adaptive Control of Reasoning Effort
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables approximately 3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
☆ GeoPep: A geometry-aware masked language model for protein-peptide binding site prediction
Multimodal approaches that integrate protein structure and sequence have achieved remarkable success in protein-protein interface prediction. However, extending these methods to protein-peptide interactions remains challenging due to the inherent conformational flexibility of peptides and the limited availability of structural data that hinder direct training of structure-aware models. To address these limitations, we introduce GeoPep, a novel framework for peptide binding site prediction that leverages transfer learning from ESM3, a multimodal protein foundation model. GeoPep fine-tunes ESM3's rich pre-learned representations from protein-protein binding to address the limited availability of protein-peptide binding data. The fine-tuned model is further integrated with a parameter-efficient neural network architecture capable of learning complex patterns from sparse data. Furthermore, the model is trained using distance-based loss functions that exploit 3D structural information to enhance binding site prediction. Comprehensive evaluations demonstrate that GeoPep significantly outperforms existing methods in protein-peptide binding site prediction by effectively capturing sparse and heterogeneous binding patterns.
comment: 11 pages, 5 figures
☆ Elastic Architecture Search for Efficient Language Models ICME 2025
As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.
comment: ICME 2025
☆ Domain decomposition architectures and Gauss-Newton training for physics-informed neural networks
Approximating the solutions of boundary value problems governed by partial differential equations with neural networks is challenging, largely due to the difficult training process. This difficulty can be partly explained by the spectral bias, that is, the slower convergence of high-frequency components, and can be mitigated by localizing neural networks via (overlapping) domain decomposition. We combine this localization with the Gauss-Newton method as the optimizer to obtain faster convergence than gradient-based schemes such as Adam; this comes at the cost of solving an ill-conditioned linear system in each iteration. Domain decomposition induces a block-sparse structure in the otherwise dense Gauss-Newton system, reducing the computational cost per iteration. Our numerical results indicate that combining localization and Gauss-Newton optimization is promising for neural network-based solvers for partial differential equations.
comment: 9 pages, 5 figures
☆ Quantitative Bounds for Length Generalization in Transformers
We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.
comment: Equal contribution, order determined by coin flip
☆ Enhancing Sentiment Classification with Machine Learning and Combinatorial Fusion
This paper presents a novel approach to sentiment classification using the application of Combinatorial Fusion Analysis (CFA) to integrate an ensemble of diverse machine learning models, achieving state-of-the-art accuracy on the IMDB sentiment analysis dataset of 97.072\%. CFA leverages the concept of cognitive diversity, which utilizes rank-score characteristic functions to quantify the dissimilarity between models and strategically combine their predictions. This is in contrast to the common process of scaling the size of individual models, and thus is comparatively efficient in computing resource use. Experimental results also indicate that CFA outperforms traditional ensemble methods by effectively computing and employing model diversity. The approach in this paper implements the combination of a transformer-based model of the RoBERTa architecture with traditional machine learning models, including Random Forest, SVM, and XGBoost.
comment: IEEE PICom 2025
☆ Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models NeurIPS 2025
Language models are traditionally designed around causal masking. In domains with spatial or relational structure, causal masking is often viewed as inappropriate, and sequential linearizations are instead used. Yet the question of whether it is viable to accept the information loss introduced by causal masking on nonsequential data has received little direct study, in part because few domains offer both spatial and sequential representations of the same dataset. In this work, we investigate this issue in the domain of chess, which naturally supports both representations. We train language models with bidirectional and causal self-attention mechanisms on both spatial (board-based) and sequential (move-based) data. Our results show that models trained on spatial board states - \textit{even with causal masking} - consistently achieve stronger playing strength than models trained on sequential data. While our experiments are conducted on chess, our results are methodological and may have broader implications: applying causal masking to spatial data is a viable procedure for training unimodal LLMs on spatial data, and in some domains is even preferable to sequentialization.
comment: 8 pages, NeurIPS 2025
☆ Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems
Mixture-of-Experts (MoE) models improve transformer efficiency but lack a unified theoretical explanation, especially when both feed-forward and attention layers are allowed to specialize. To this end, we study the Mixture-of-Transformers (MoT), a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network. This design allows us to isolate and study the core learning dynamics of expert specialization and attention alignment. In particular, we develop a three-stage training algorithm with continuous training of the gating network, and show that each transformer expert specializes in a distinct class of tasks and that the gating network accurately routes data samples to the correct expert. Our analysis shows how expert specialization reduces gradient conflicts and makes each subtask strongly convex. We prove that the training drives the expected prediction loss to near zero in $O(\log(\epsilon^{-1}))$ iteration steps, significantly improving over the $O(\epsilon^{-1})$ rate for a single transformer. We further validate our theoretical findings through extensive real-data experiments, demonstrating the practical effectiveness of MoT. Together, these results offer the first unified theoretical account of transformer-level specialization and learning dynamics, providing practical guidance for designing efficient large-scale models.
☆ Are Online Sports Fan Communities Becoming More Offensive? A Quantitative Review of Topics, Trends, and Toxicity of r/PremierLeague
Online communities for sports fans have surged in popularity, with Reddit's r/PremierLeague emerging as a focal point for fans of one of the globe's most celebrated sports leagues. This boom has helped the Premier League make significant inroads into the US market, increasing viewership and sparking greater interest in its matches. Despite the league's broad appeal, there's still a notable gap in understanding its online fan community. Therefore, we analyzed a substantial dataset of over 1.1 million comments posted from 2013-2022 on r/PremierLeague. Our study delves into the sentiment, topics, and toxicity of these discussions, tracking trends over time, aiming to map out the conversation landscape. The rapid expansion has brought more diverse discussions, but also a worrying rise in negative sentiment and toxicity. Additionally, the subreddit has become a venue for users to voice frustrations about broader societal issues like racism, the COVID-19 pandemic, and political tensions.
☆ Jasmine: A Simple, Performant and Scalable JAX-based World Modeling Codebase
While world models are increasingly positioned as a pathway to overcoming data scarcity in domains such as robotics, open training infrastructure for world modeling remains nascent. We introduce Jasmine, a performant JAX-based world modeling codebase that scales from single hosts to hundreds of accelerators with minimal code changes. Jasmine achieves an order-of-magnitude faster reproduction of the CoinRun case study compared to prior open implementations, enabled by performance optimizations across data loading, training and checkpointing. The codebase guarantees fully reproducible training and supports diverse sharding configurations. By pairing Jasmine with curated large-scale datasets, we establish infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.
comment: Blog post: https://pdoom.org/jasmine.html
☆ A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms
Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
comment: Preprint of a paper presented at GI Skill 2025. The final version will appear in the conference proceedings
☆ Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules NeurIPS 2025
Learning rules -- prescriptions for updating model parameters to improve performance -- are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.
comment: Accepted to NeurIPS 2025
☆ LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval
Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.
comment: 8 pages
☆ HADSF: Aspect Aware Semantic Control for Explainable Recommendation WSDM 2026
Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at https://github.com/niez233/HADSF
comment: Accepted by WSDM 2026
☆ Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30% of the original budget.
☆ Predicting Household Water Consumption Using Satellite and Street View Images in Two Indian Cities
Monitoring household water use in rapidly urbanizing regions is hampered by costly, time-intensive enumeration methods and surveys. We investigate whether publicly available imagery-satellite tiles, Google Street View (GSV) segmentation-and simple geospatial covariates (nightlight intensity, population density) can be utilized to predict household water consumption in Hubballi-Dharwad, India. We compare four approaches: survey features (benchmark), CNN embeddings (satellite, GSV, combined), and GSV semantic maps with auxiliary data. Under an ordinal classification framework, GSV segmentation plus remote-sensing covariates achieves 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). Error analysis shows high precision at extremes of the household water consumption distribution, but confusion among middle classes is due to overlapping visual proxies. We also compare and contrast our estimates for household water consumption to that of household subjective income. Our findings demonstrate that open-access imagery, coupled with minimal geospatial data, offers a promising alternative to obtaining reliable household water consumption estimates using surveys in urban analytics.
☆ Can machines think efficiently?
The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.
☆ Mind the Gaps: Auditing and Reducing Group Inequity in Large-Scale Mobility Prediction
Next location prediction underpins a growing number of mobility, retail, and public-health applications, yet its societal impacts remain largely unexplored. In this paper, we audit state-of-the-art mobility prediction models trained on a large-scale dataset, highlighting hidden disparities based on user demographics. Drawing from aggregate census data, we compute the difference in predictive performance on racial and ethnic user groups and show a systematic disparity resulting from the underlying dataset, resulting in large differences in accuracy based on location and user groups. To address this, we propose Fairness-Guided Incremental Sampling (FGIS), a group-aware sampling strategy designed for incremental data collection settings. Because individual-level demographic labels are unavailable, we introduce Size-Aware K-Means (SAKM), a clustering method that partitions users in latent mobility space while enforcing census-derived group proportions. This yields proxy racial labels for the four largest groups in the state: Asian, Black, Hispanic, and White. Built on these labels, our sampling algorithm prioritizes users based on expected performance gains and current group representation. This method incrementally constructs training datasets that reduce demographic performance gaps while preserving overall accuracy. Our method reduces total disparity between groups by up to 40\% with minimal accuracy trade-offs, as evaluated on a state-of-art MetaPath2Vec model and a transformer-encoder model. Improvements are most significant in early sampling stages, highlighting the potential for fairness-aware strategies to deliver meaningful gains even in low-resource settings. Our findings expose structural inequities in mobility prediction pipelines and demonstrate how lightweight, data-centric interventions can improve fairness with little added complexity, especially for low-data applications.
☆ MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models NeurIPS 2025
Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.
comment: NeurIPS 2025 Datasets and Benchmarks Track poster
☆ Discovering EV Charging Site Archetypes Through Few Shot Forecasting: The First U.S.-Wide Study NeurIPS 2025
The decarbonization of transportation relies on the widespread adoption of electric vehicles (EVs), which requires an accurate understanding of charging behavior to ensure cost-effective, grid-resilient infrastructure. Existing work is constrained by small-scale datasets, simple proximity-based modeling of temporal dependencies, and weak generalization to sites with limited operational history. To overcome these limitations, this work proposes a framework that integrates clustering with few-shot forecasting to uncover site archetypes using a novel large-scale dataset of charging demand. The results demonstrate that archetype-specific expert models outperform global baselines in forecasting demand at unseen sites. By establishing forecast performance as a basis for infrastructure segmentation, we generate actionable insights that enable operators to lower costs, optimize energy and pricing strategies, and support grid resilience critical to climate goals.
comment: Tackling Climate Change with Machine Learning: Workshop at NeurIPS 2025
☆ Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering
This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system's interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.
comment: Presented as a talk at the 2025 AIChE Annual Conference
☆ BI-DCGAN: A Theoretically Grounded Bayesian Framework for Efficient and Diverse GANs
Generative Adversarial Networks (GANs) are proficient at generating synthetic data but continue to suffer from mode collapse, where the generator produces a narrow range of outputs that fool the discriminator but fail to capture the full data distribution. This limitation is particularly problematic, as generative models are increasingly deployed in real-world applications that demand both diversity and uncertainty awareness. In response, we introduce BI-DCGAN, a Bayesian extension of DCGAN that incorporates model uncertainty into the generative process while maintaining computational efficiency. BI-DCGAN integrates Bayes by Backprop to learn a distribution over network weights and employs mean-field variational inference to efficiently approximate the posterior distribution during GAN training. We establishes the first theoretical proof, based on covariance matrix analysis, that Bayesian modeling enhances sample diversity in GANs. We validate this theoretical result through extensive experiments on standard generative benchmarks, demonstrating that BI-DCGAN produces more diverse and robust outputs than conventional DCGANs, while maintaining training efficiency. These findings position BI-DCGAN as a scalable and timely solution for applications where both diversity and uncertainty are critical, and where modern alternatives like diffusion models remain too resource-intensive.
☆ The Denario project: Deep knowledge AI agents for scientific discovery
We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.
comment: 272 pages. Examples of 11 AI-generated paper drafts from different scientific disciplines. Code publicly available at https://github.com/AstroPilot-AI/Denario
☆ Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base
Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search -- retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.
comment: 43 pages, 4 figures
♻ ☆ UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
♻ ☆ S-CFE: Simple Counterfactual Explanations
We study the problem of finding optimal sparse, manifold-aligned counterfactual explanations for classifiers. Canonically, this can be formulated as an optimization problem with multiple non-convex components, including classifier loss functions and manifold alignment (or \emph{plausibility}) metrics. The added complexity of enforcing \emph{sparsity}, or shorter explanations, complicates the problem further. Existing methods often focus on specific models and plausibility measures, relying on convex $\ell_1$ regularizers to enforce sparsity. In this paper, we tackle the canonical formulation using the accelerated proximal gradient (APG) method, a simple yet efficient first-order procedure capable of handling smooth non-convex objectives and non-smooth $\ell_p$ (where $0 \leq p < 1$) regularizers. This enables our approach to seamlessly incorporate various classifiers and plausibility measures while producing sparser solutions. Our algorithm only requires differentiable data-manifold regularizers and supports box constraints for bounded feature ranges, ensuring the generated counterfactuals remain \emph{actionable}. Finally, experiments on real-world datasets demonstrate that our approach effectively produces sparse, manifold-aligned counterfactual explanations while maintaining proximity to the factual data and computational efficiency.
♻ ☆ Direct Debiased Machine Learning via Bregman Divergence Minimization
We develop a direct debiased machine learning framework comprising Neyman targeted estimation and generalized Riesz regression. Our framework unifies Riesz regression for automatic debiased machine learning, covariate balancing, targeted maximum likelihood estimation (TMLE), and density-ratio estimation. In many problems involving causal effects or structural models, the parameters of interest depend on regression functions. Plugging regression functions estimated by machine learning methods into the identifying equations can yield poor performance because of first-stage bias. To reduce such bias, debiased machine learning employs Neyman orthogonal estimating equations. Debiased machine learning typically requires estimation of the Riesz representer and the regression function. For this problem, we develop a direct debiased machine learning framework with an end-to-end algorithm. We formulate estimation of the nuisance parameters, the regression function and the Riesz representer, as minimizing the discrepancy between Neyman orthogonal scores computed with known and unknown nuisance parameters, which we refer to as Neyman targeted estimation. Neyman targeted estimation includes Riesz representer estimation, and we measure discrepancies using the Bregman divergence. The Bregman divergence encompasses various loss functions as special cases, where the squared loss yields Riesz regression and the Kullback-Leibler divergence yields entropy balancing. We refer to this Riesz representer estimation as generalized Riesz regression. Neyman targeted estimation also yields TMLE as a special case for regression function estimation. Furthermore, for specific pairs of models and Riesz representer estimation methods, we can automatically obtain the covariate balancing property without explicitly solving the covariate balancing objective.
♻ ☆ Adversarial generalization of unfolding (model-based) networks NeurIPS2025
Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.
comment: Accepted at NeurIPS2025
♻ ☆ Advancing Local Clustering on Graphs via Compressive Sensing: Semi-supervised and Unsupervised Methods
Local clustering aims to identify specific substructures within a large graph without any additional structural information of the graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data are given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes, and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state of the art results in the low-label rates regime.
♻ ☆ Partially-Supervised Neural Network Model For Quadratic Multiparametric Programming
Neural Networks (NN) with ReLU activation functions are used to model multiparametric quadratic optimization problems (mp-QP) in diverse engineering applications. Researchers have suggested leveraging the piecewise affine property of deep NN models to solve mp-QP with linear constraints, which also exhibit piecewise affine behaviour. However, traditional deep NN applications to mp-QP fall short of providing optimal and feasible predictions, even when trained on large datasets. This study proposes a partially-supervised NN (PSNN) architecture that directly represents the mathematical structure of the global solution function. In contrast to generic NN training approaches, the proposed PSNN method derives a large proportion of model weights directly from the mathematical properties of the optimization problem, producing more accurate solutions despite significantly smaller training data sets. Many energy management problems are formulated as QP, so we apply the proposed approach to energy systems (specifically DC optimal power flow) to demonstrate proof of concept. Model performance in terms of solution accuracy and speed of predictions was compared against a commercial solver and a generic Deep NN model based on classical training. Results show KKT sufficient conditions for PSNN consistently outperform generic NN architectures with classical training using far less data, including when tested on extreme, out-of-training distribution test data. Given its speed advantages over traditional solvers, the PSNN model can quickly produce optimal and feasible solutions within a second for millions of input parameters sampled from a distribution of stochastic demands and renewable generator dispatches, which can be used for simulations and long term planning.
comment: 36 pages including references and appendix
♻ ☆ Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.
♻ ☆ When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product
State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models: https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ Curriculum Abductive Learning NeurIPS 2025
Abductive Learning (ABL) integrates machine learning with logical reasoning in a loop: a learning model predicts symbolic concept labels from raw inputs, which are revised through abduction using domain knowledge and then fed back for retraining. However, due to the nondeterminism of abduction, the training process often suffers from instability, especially when the knowledge base is large and complex, resulting in a prohibitively large abduction space. While prior works focus on improving candidate selection within this space, they typically treat the knowledge base as a static black box. In this work, we propose Curriculum Abductive Learning (C-ABL), a method that explicitly leverages the internal structure of the knowledge base to address the ABL training challenges. C-ABL partitions the knowledge base into a sequence of sub-bases, progressively introduced during training. This reduces the abduction space throughout training and enables the model to incorporate logic in a stepwise, smooth way. Experiments across multiple tasks show that C-ABL outperforms previous ABL implementations, significantly improves training stability, convergence speed, and final accuracy, especially under complex knowledge setting.
comment: Accepted by NeurIPS 2025, 22 pages, 6 figures
♻ ☆ Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models WACV 2026
Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new opportunities for model training, as the two regimes offer complementary trade-offs: decentralized data is abundant but subject to heterogeneity and communication constraints, while centralized data, though limited in volume and potentially unrepresentative, enables better curation and high-throughput access. Despite its potential, effectively combining these paradigms remains challenging, and few frameworks are tailored to hybrid data regimes. To address this, we propose a novel framework that constructs a model atlas from decentralized models and leverages centralized data to refine a global model within this structured space. The refined model is then used to reinitialize the decentralized models. Our method synergizes federated learning (to exploit decentralized data) and model merging (to utilize centralized data), enabling effective training under hybrid data availability. Theoretically, we show that our approach achieves faster convergence than methods relying solely on decentralized data, due to variance reduction in the merging process. Extensive experiments demonstrate that our framework consistently outperforms purely centralized, purely decentralized, and existing hybrid-adaptable methods. Notably, our method remains robust even when the centralized and decentralized data domains differ or when decentralized data contains noise, significantly broadening its applicability.
comment: Accepted at WACV 2026
♻ ☆ GSE: Group-wise Sparse and Explainable Adversarial Attacks
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. We address this by presenting a two-phase algorithm that generates group-wise sparse attacks within semantically meaningful areas of an image. Initially, we optimize a quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored for non-convex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2-$norm regularization applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and ImageNet datasets demonstrate a remarkable increase in group-wise sparsity, e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted attack). This performance improvement is accompanied by significantly faster computation times, improved explainability, and a $100\%$ attack success rate.
♻ ☆ Understanding Generalization in Node and Link Prediction
Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs' generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.\@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs' generalization capabilities in these tasks.
comment: arXiv admin note: text overlap with arXiv:2412.07106
♻ ☆ RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.
♻ ☆ PepCompass: Navigating peptide embedding spaces using Riemannian Geometry
Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent "maps" of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of $\kappa$-Stable Riemannian Manifolds $\mathbb{M}^{\kappa}$, a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.
♻ ☆ C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models NeurIPS 2025
Low-Rank Adaptation (LoRA) offers a cost-effective solution for fine-tuning large language models (LLMs), but it often produces overconfident predictions in data-scarce few-shot settings. To address this issue, several classical statistical learning approaches have been repurposed for scalable uncertainty-aware LoRA fine-tuning. However, these approaches neglect how input characteristics affect the predictive uncertainty estimates. To address this limitation, we propose Contextual Low-Rank Adaptation (C-LoRA) as a novel uncertainty-aware and parameter efficient fine-tuning approach, by developing new lightweight LoRA modules contextualized to each input data sample to dynamically adapt uncertainty estimates. Incorporating data-driven contexts into the parameter posteriors, C-LoRA mitigates overfitting, achieves well-calibrated uncertainties, and yields robust predictions. Extensive experiments on LLaMA2-7B models demonstrate that C-LoRA consistently outperforms the state-of-the-art uncertainty-aware LoRA methods in both uncertainty quantification and model generalization. Ablation studies further confirm the critical role of our contextual modules in capturing sample-specific uncertainties. C-LoRA sets a new standard for robust, uncertainty-aware LLM fine-tuning in few-shot regimes. Although our experiments are limited to 7B models, our method is architecture-agnostic and, in principle, applies beyond this scale; studying its scaling to larger models remains an open problem. Our code is available at https://github.com/ahra99/c_lora.
comment: Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
♻ ☆ SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing
Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.
comment: 11 pages
♻ ☆ Differentiation Through Black-Box Quadratic Programming Solvers
Differentiable optimization has attracted significant research interest, particularly for quadratic programming (QP). Existing approaches for differentiating the solution of a QP with respect to its defining parameters often rely on specific integrated solvers. This integration limits their applicability, including their use in neural network architectures and bi-level optimization tasks, restricting users to a narrow selection of solver choices. To address this limitation, we introduce dQP, a modular and solver-agnostic framework for plug-and-play differentiation of virtually any QP solver. A key insight we leverage to achieve modularity is that, once the active set of inequality constraints is known, both the solution and its derivative can be expressed using simplified linear systems that share the same matrix. This formulation fully decouples the computation of the QP solution from its differentiation. Building on this result, we provide a minimal-overhead, open-source implementation ( https://github.com/cwmagoon/dQP ) that seamlessly integrates with over 15 state-of-the-art solvers. Comprehensive benchmark experiments demonstrate dQP's robustness and scalability, particularly highlighting its advantages in large-scale sparse problems.
♻ ☆ Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models
Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.
comment: Accepted for Presentation at 64th IEEE Conference on Decision and Control, CDC 2025, Rio de Janeiro, Brazil, 2025
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for table 3, random effect is the model version; v3 changelog: Fixed minor formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model description
♻ ☆ Incentivizing LLMs to Self-Verify Their Answers
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.
♻ ☆ A mathematical certification for positivity conditions in Neural Networks with applications to partial monotonicity and Trustworthy AI
Artificial Neural Networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their black-box nature poses trustworthiness challenges. In certain situations, ensuring trust in predictions might require following specific partial monotonicity constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this paper presents a novel algorithm (LipVor) that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Consequently, since partial monotonicity can be expressed as a positivity condition on partial derivatives, LipVor can certify whether an ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the black-box model is used to construct a specific neighborhood where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Unlike prior methods, our approach certifies partial monotonicity without constrained architectures or piece-wise linear activations. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.
comment: 16 pages, 4 figures
♻ ☆ GFlowNets for Learning Better Drug-Drug Interaction Representations ICANN 2025
Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.
comment: Accepted to ICANN 2025:AIDD and NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling (https://openreview.net/forum?id=LZW1jSgfCI)
♻ ☆ Exploring the Early Universe with Deep Learning
Hydrogen is the most abundant element in our Universe. The first generation of stars and galaxies produced photons that ionized hydrogen gas, driving a cosmological event known as the Epoch of Reionization (EoR). The upcoming Square Kilometre Array Observatory (SKAO) will map the distribution of neutral hydrogen during this era, aiding in the study of the properties of these first-generation objects. Extracting astrophysical information will be challenging, as SKAO will produce a tremendous amount of data where the hydrogen signal will be contaminated with undesired foreground contamination and instrumental systematics. To address this, we develop the latest deep learning techniques to extract information from the 2D power spectra of the hydrogen signal expected from SKAO. We apply a series of neural network models to these measurements and quantify their ability to predict the history of cosmic hydrogen reionization, which is connected to the increasing number and efficiency of early photon sources. We show that the study of the early Universe benefits from modern deep learning technology. In particular, we demonstrate that dedicated machine learning algorithms can achieve more than a $0.95$ $R^2$ score on average in recovering the reionization history. This enables accurate and precise cosmological and astrophysical inference of structure formation in the early Universe.
comment: EPIA 2025 preprint version, 12 pages, 3 figures
♻ ☆ Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling NeurIPS 2025
Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.
comment: Accepted at NeurIPS 2025
♻ ☆ A geometric framework for momentum-based optimizers for low-rank training
Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.
♻ ☆ LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding
Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Distributed optimization: designed for federated learning
Federated learning (FL), as a distributed collaborative machine learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.
comment: 16 pages, 6 figures
♻ ☆ On the Impact of Performative Risk Minimization for Binary Random Variables ICML 2025
Performativity, the phenomenon where outcomes are influenced by predictions, is particularly prevalent in social contexts where individuals strategically respond to a deployed model. In order to preserve the high accuracy of machine learning models under distribution shifts caused by performativity, Perdomo et al. (2020) introduced the concept of performative risk minimization (PRM). While this framework ensures model accuracy, it overlooks the impact of the PRM on the underlying distributions and the predictions of the model. In this paper, we initiate the analysis of the impact of PRM, by studying performativity for a sequential performative risk minimization problem with binary random variables and linear performative shifts. We formulate two natural measures of impact. In the case of full information, where the distribution dynamics are known, we derive explicit formulas for the PRM solution and our impact measures. In the case of partial information, we provide performative-aware statistical estimators, as well as simulations. Our analysis contrasts PRM to alternatives that do not model data shift and indicates that PRM can have amplified side effects compared to such methods.
comment: ICML 2025 camera-ready
♻ ☆ Machine-learning competition to grade EEG background patterns in newborns with hypoxic-ischaemic encephalopathy
Machine learning (ML) has the potential to support and improve expert performance in monitoring the brain function of at-risk newborns. Developing accurate and reliable ML models depends on access to high-quality, annotated data, a resource in short supply. ML competitions address this need by providing researchers access to expertly annotated datasets, fostering shared learning through direct model comparisons, and leveraging the benefits of crowdsourcing diverse expertise. We compiled a retrospective dataset containing 353 hours of EEG from 102 individual newborns from a multi-centre study. The data was fully anonymised and divided into training, testing, and held-out validation datasets. EEGs were graded for the severity of abnormal background patterns. Next, we created a web-based competition platform and hosted a machine learning competition to develop ML models for classifying the severity of EEG background patterns in newborns. After the competition closed, the top 4 performing models were evaluated offline on a separate held-out validation dataset. Although a feature-based model ranked first on the testing dataset, deep learning models generalised better on the validation sets. All methods had a significant decline in validation performance compared to the testing performance. This highlights the challenges for model generalisation on unseen data, emphasising the need for held-out validation datasets in ML studies with neonatal EEG. The study underscores the importance of training ML models on large and diverse datasets to ensure robust generalisation. The competition's outcome demonstrates the potential for open-access data and collaborative ML development to foster a collaborative research environment and expedite the development of clinical decision-support tools for neonatal neuromonitoring.
comment: 29 pages, supplementary materials: "supplementary materials ML Comp.docx"
♻ ☆ Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
comment: Added Acknowledgments section and hyphenated last names
♻ ☆ Cybersecurity threat detection based on a UEBA framework using Deep Autoencoders
User and Entity Behaviour Analytics (UEBA) is a broad branch of data analytics that attempts to build a normal behavioural profile in order to detect anomalous events. Among the techniques used to detect anomalies, Deep Autoencoders constitute one of the most promising deep learning models on UEBA tasks, allowing explainable detection of security incidents that could lead to the leak of personal data, hijacking of systems, or access to sensitive business information. In this study, we introduce the first implementation of an explainable UEBA-based anomaly detection framework that leverages Deep Autoencoders in combination with Doc2Vec to process both numerical and textual features. Additionally, based on the theoretical foundations of neural networks, we offer a novel proof demonstrating the equivalence of two widely used definitions for fully-connected neural networks. The experimental results demonstrate the proposed framework capability to detect real and synthetic anomalies effectively generated from real attack data, showing that the models provide not only correct identification of anomalies but also explainable results that enable the reconstruction of the possible origin of the anomaly. Our findings suggest that the proposed UEBA framework can be seamlessly integrated into enterprise environments, complementing existing security systems for explainable threat detection.
comment: Published in AIMS Mathematics (2025), 10(10): 23496-23517. DOI: 10.3934/math.20251043
♻ ☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. These games typically fall into three categories based on the flow of control: strictly sequential (players alternate single actions), deterministic response (some actions trigger a fixed outcome), and unbounded reciprocal response (alternating counterplays are permitted). A less-explored but strategically rich structure is the bounded one-sided response, where a player's action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that isolates this dynamic, where a Rent action forces the opponent to choose payment assets. The gold-standard algorithm, Counterfactual Regret Minimization (CFR), converges on effective strategies without novel algorithmic extensions. A lightweight full-stack research platform unifies the environment, a parallelized CFR runtime, and a human-playable web interface. The trained CFR agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
♻ ☆ In Defence of Post-hoc Explainability NeurIPS 2024
This position paper defends post-hoc explainability methods as legitimate tools for scientific knowledge production in machine learning. Addressing criticism of these methods' reliability and epistemic status, we develop a philosophical framework grounded in mediated understanding and bounded factivity. We argue that scientific insights can emerge through structured interpretation of model behaviour without requiring complete mechanistic transparency, provided explanations acknowledge their approximative nature and undergo rigorous empirical validation. Through analysis of recent biomedical ML applications, we demonstrate how post-hoc methods, when properly integrated into scientific practice, generate novel hypotheses and advance phenomenal understanding.
comment: v1 presented at the Interpretable AI: Past, Present, and Future Workshop at NeurIPS 2024 (non-archival)
♻ ☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
♻ ☆ Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
comment: 12 pages, 17 figures. Preprint
♻ ☆ TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
comment: 14 pages, 16 figures. For source code, see https://github.com/microsoft/tokenweave. In version 2, Figure 6 shows All-Reduce bandwidth instead of Reduce-Scatter. The Multimem Reduce-Scatter bandwidth formula differs slightly from the ring-based version. Fixed x-ticks in Figure 7
♻ ☆ Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability
We consider the fundamental problem of estimating a discrete distribution on a domain of size~$K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/\delta)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/\delta)\big) /n$ at error probability $\delta$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/\delta)\big) /n$, i.e.\ without the $\ln K$ multiplying $\ln (1/\delta)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by he maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/\delta) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and~$\delta$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.
♻ ☆ Predictive Causal Inference via Spatio-Temporal Modeling and Penalized Empirical Likelihood
This study introduces an integrated framework for predictive causal inference designed to overcome limitations inherent in conventional single model approaches. Specifically, we combine a Hidden Markov Model (HMM) for spatial health state estimation with a Multi Task and Multi Graph Convolutional Network (MTGCN) for capturing temporal outcome trajectories. The framework asymmetrically treats temporal and spatial information regarding them as endogenous variables in the outcome regression, and exogenous variables in the propensity score model, thereby expanding the standard doubly robust treatment effect estimation to jointly enhance bias correction and predictive accuracy. To demonstrate its utility, we focus on clinical domains such as cancer, dementia, and Parkinson disease, where treatment effects are challenging to observe directly. Simulation studies are conducted to emulate latent disease dynamics and evaluate the model performance under varying conditions. Overall, the proposed framework advances predictive causal inference by structurally adapting to spatiotemporal complexities common in biomedical data.
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ SafEDMD: A Koopman-based data-driven controller design framework for nonlinear dynamical systems
The Koopman operator serves as the theoretical backbone for machine learning of dynamical control systems, where the operator is heuristically approximated by extended dynamic mode decomposition (EDMD). In this paper, we propose SafEDMD, a novel stability- and feedback-oriented EDMD-based controller design framework. Our approach leverages a reliable surrogate model generated in a data-driven fashion in order to provide closed-loop guarantees. In particular, we establish a controller design based on semi-definite programming with guaranteed stabilization of the underlying nonlinear system. As central ingredient, we derive proportional error bounds that vanish at the origin and are tailored to control tasks. We illustrate the developed method by means of several benchmark examples and highlight the advantages over state-of-the-art methods.
comment: Accepted for publication in Automatica
♻ ☆ Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks NeurIPS 2025
Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks $4.5 \times$ faster in wall-clock time, reduce the discretization gap by $98\%$, and reduce the number of unused gates by $100\%$.
comment: Accepted to NeurIPS 2025 (main track)
♻ ☆ Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset
Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.
comment: 16 pages, 7 figures, 6 tables; updated acknowledgments
♻ ☆ AIMeter: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents AIMeter, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, AIMeter offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, AIMeter encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/AIMeter.
comment: 11 pages, 7 figures and 5 tables
♻ ☆ Stability and Sharper Risk Bounds with Convergence Rate $\tilde{O}(1/n^2)$
Prior work (Klochkov $\&$ Zhivotovskiy, 2021) establishes at most $O\left(\log (n)/n\right)$ excess risk bounds via algorithmic stability for strongly-convex learners with high probability. We show that under the similar common assumptions -- - Polyak-Lojasiewicz condition, smoothness, and Lipschitz continous for losses -- - rates of $O\left(\log^2(n)/n^2\right)$ are at most achievable. To our knowledge, our analysis also provides the tightest high-probability bounds for gradient-based generalization gaps in nonconvex settings.
♻ ☆ Hysteresis Activation Function for Efficient Inference NeurIPS
The widely used ReLU is favored for its hardware efficiency, {as the implementation at inference is a one bit sign case,} yet suffers from issues such as the ``dying ReLU'' problem, where during training, neurons fail to activate and constantly remain at zero, as highlighted by Lu et al. Traditional approaches to mitigate this issue often introduce more complex and less hardware-friendly activation functions. In this work, we propose a Hysteresis Rectified Linear Unit (HeLU), an efficient activation function designed to address the ``dying ReLU'' problem with minimal complexity. Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation. This refined mechanism allows simpler activation functions to achieve competitive performance comparable to their more complex counterparts without introducing unnecessary complexity or requiring inductive biases. Empirical evaluations demonstrate that HeLU enhances model generalization across diverse datasets, offering a promising solution for efficient and effective inference suitable for a wide range of neural network architectures.
comment: Accepted to 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024)
♻ ☆ GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
comment: Project Page: https://jingw193.github.io/GRPO-Guard/
♻ ☆ Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL NeurIPS 2025
A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works -- SMAC, RWARE, and Multi-Agent MuJoCo -- covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.
comment: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Neurosymbolic Diffusion Models NeurIPS 2025
Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.
comment: Accepted to NeurIPS 2025
♻ ☆ AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning
Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo - The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity.
comment: Journal submission in preparation to RASTI; 15 pages; 12 figures
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.
comment: 54 pages, 20 figures
♻ ☆ Experiments with Optimal Model Trees
Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.
♻ ☆ Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance
Neural network-based decoding methods show promise in enhancing error correction performance but face challenges with punctured codes. In particular, existing methods struggle to adapt to variable code rates or meet protocol compatibility requirements. This paper proposes a unified long short-term memory (LSTM)-based neural decoder for punctured convolutional and Turbo codes to address these challenges. The key component of the proposed LSTM-based neural decoder is puncturing-aware embedding, which integrates puncturing patterns directly into the neural network to enable seamless adaptation to different code rates. Moreover, a balanced bit error rate training strategy is designed to ensure the decoder's robustness across various code lengths, rates, and channels. In this way, the protocol compatibility requirement can be realized. Extensive simulations in both additive white Gaussian noise (AWGN) and Rayleigh fading channels demonstrate that the proposed neural decoder outperforms conventional decoding techniques, offering significant improvements in decoding accuracy and robustness.
♻ ☆ Beyond likelihood ratio bias: Nested multi-time-scale stochastic approximation for likelihood-free parameter estimation
We study parameter inference in simulation-based stochastic models where the analytical form of the likelihood is unknown. The main difficulty is that score evaluation as a ratio of noisy Monte Carlo estimators induces bias and instability, which we overcome with a ratio-free nested multi-time-scale (NMTS) stochastic approximation (SA) method that simultaneously tracks the score and drives the parameter update. We provide a comprehensive theoretical analysis of the proposed NMTS algorithm for solving likelihood-free inference problems, including strong convergence, asymptotic normality, and convergence rates. We show that our algorithm can eliminate the original asymptotic bias $O\big(\sqrt{\frac{1}{N}}\big)$ and accelerate the convergence rate from $O\big(\beta_k+\sqrt{\frac{1}{N}}\big)$ to $O\big(\frac{\beta_k}{\alpha_k}+\sqrt{\frac{\alpha_k}{N}}\big)$, where $N$ is the fixed batch size, $\alpha_k$ and $\beta_k$ are decreasing step sizes with $\alpha_k$, $\beta_k$, $\beta_k/\alpha_k\rightarrow 0$. With proper choice of $\alpha_k$ and $\beta_k$, our convergence rates can match the optimal rate in the multi-time-scale SA literature. Numerical experiments demonstrate that our algorithm can improve the estimation accuracy by one to two orders of magnitude at the same computational cost, making it efficient for parameter estimation in stochastic systems.
♻ ☆ Chaos-based reinforcement learning with TD3
Chaos-based reinforcement learning (CBRL) is a method in which the agent's internal chaotic dynamics drives exploration. However, the learning algorithms in CBRL have not been thoroughly developed in previous studies, nor have they incorporated recent advances in reinforcement learning. This study introduced Twin Delayed Deep Deterministic Policy Gradients (TD3), which is one of the state-of-the-art deep reinforcement learning algorithms that can treat deterministic and continuous action spaces, to CBRL. The validation results provide several insights. First, TD3 works as a learning algorithm for CBRL in a simple goal-reaching task. Second, CBRL agents with TD3 can autonomously suppress their exploratory behavior as learning progresses and resume exploration when the environment changes. Finally, examining the effect of the agent's chaoticity on learning shows that there exists a suitable range of chaos strength in the agent's model to flexibly switch between exploration and exploitation and adapt to environmental changes.
comment: Accepted for publication in Neural Networks
♻ ☆ Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training NeurIPS 2025
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
comment: Published at NeurIPS 2025
♻ ☆ VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference
Decentralized inference provides a scalable and resilient paradigm for serving large language models (LLMs), enabling distributed resource utilization and reducing reliance on centralized providers. However, in a permissionless environment without trusted nodes, ensuring the correctness of model outputs remains a core challenge. We introduce VeriLLM, a publicly verifiable protocol for decentralized LLM inference that achieves security under a one-honest-verifier assumption while maintaining practical efficiency. VeriLLM combines lightweight empirical rerunning with cryptographic commitments, allowing verifiers to validate results at approximately 1% of the underlying inference cost. To prevent verification bottlenecks, we design an isomorphic inference-verification architecture that multiplexes both inference and verification roles across the same GPU workers. This design (i) improves GPU utilization and overall throughput, (ii) enlarges the effective validator set, enhancing robustness and liveness, and (iii) enforces task indistinguishability to prevent node-specific optimizations or selective behavior. Through theoretical analysis and system-level evaluation, we show that VeriLLM achieves reliable public verifiability with minimal overhead, offering a practical foundation for trustworthy and scalable decentralized LLM inference.
comment: 20 pages, 4 figures, 6 tables
♻ ☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award). Published in "Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1"
♻ ☆ MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver NeurIPS 2025
Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach to train a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables the efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.
comment: Accepted at NeurIPS 2025
♻ ☆ Unified Error Correction Code Transformer with Low Complexity
Channel coding is vital for reliable sixth-generation (6G) data transmission, employing diverse error correction codes for various application scenarios. Traditional decoders require dedicated hardware for each code, leading to high hardware costs. Recently, artificial intelligence (AI)-driven approaches, such as the error correction code Transformer (ECCT) and its enhanced version, the foundation error correction code Transformer (FECCT), have been proposed to reduce the hardware cost by leveraging the Transformer to decode multiple codes. However, their excessively high computational complexity of $\mathcal{O}(N^2)$ due to the self-attention mechanism in the Transformer limits scalability, where $N$ represents the sequence length. To reduce computational complexity, we propose a unified Transformer-based decoder that handles multiple linear block codes within a single framework. Specifically, a standardized unit is employed to align code length and code rate across different code types, while a redesigned low-rank unified attention module, with computational complexity of $\mathcal{O}(N)$, is shared across various heads in the Transformer. Additionally, a sparse mask, derived from the parity-check matrix's sparsity, is introduced to enhance the decoder's ability to capture inherent constraints between information and parity-check bits, improving decoding accuracy and further reducing computational complexity by $86\%$. Extensive experimental results demonstrate that the proposed unified Transformer-based decoder outperforms existing methods and provides a high-performance, low-complexity solution for next-generation wireless communication systems.
♻ ☆ Optimal Online Change Detection via Random Fourier Features NeurIPS 2025
This article studies the problem of online non-parametric change point detection in multivariate data streams. We approach the problem through the lens of kernel-based two-sample testing and introduce a sequential testing procedure based on random Fourier features, running with logarithmic time complexity per observation and with overall logarithmic space complexity. The algorithm has two advantages compared to the state of the art. First, our approach is genuinely online, and no access to training data known to be from the pre-change distribution is necessary. Second, the algorithm does not require the user to specify a window parameter over which local tests are to be calculated. We prove strong theoretical guarantees on the algorithm's performance, including information-theoretic bounds demonstrating that the detection delay is optimal in the minimax sense. Numerical studies on real and synthetic data show that our algorithm is competitive with respect to the state of the art.
comment: Accepted for publication at NeurIPS 2025
♻ ☆ Rethinking Neural Combinatorial Optimization for Vehicle Routing Problems with Different Constraint Tightness Degrees NeurIPS 2025
Recent neural combinatorial optimization (NCO) methods have shown promising problem-solving ability without requiring domain-specific expertise. Most existing NCO methods use training and testing data with a fixed constraint value and lack research on the effect of constraint tightness on the performance of NCO methods. This paper takes the capacity-constrained vehicle routing problem (CVRP) as an example to empirically analyze the NCO performance under different tightness degrees of the capacity constraint. Our analysis reveals that existing NCO methods overfit the capacity constraint, and they can only perform satisfactorily on a small range of the constraint values but poorly on other values. To tackle this drawback of existing NCO methods, we develop an efficient training scheme that explicitly considers varying degrees of constraint tightness and proposes a multi-expert module to learn a generally adaptable solving strategy. Experimental results show that the proposed method can effectively overcome the overfitting issue, demonstrating superior performances on the CVRP and CVRP with time windows (CVRPTW) with various constraint tightness degrees.
comment: Accepted at NeurIPS 2025
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE
Efficient Regression-Based Training of Normalizing Flows for Boltzmann Generators ICML
Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple $\ell_2$-regression objective. Specifically, RegFlow maps prior samples under our flow to targets computed using optimal transport couplings or a pre-trained continuous normalizing flow (CNF). To enhance numerical stability, RegFlow employs effective regularization strategies such as a new forward-backward self-consistency loss that enjoys painless implementation. Empirically, we demonstrate that RegFlow unlocks a broader class of architectures that were previously intractable to train for BGs with maximum likelihood. We also show RegFlow exceeds the performance, computational cost, and stability of maximum likelihood training in equilibrium sampling in Cartesian coordinates of alanine dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular systems.
comment: Preprint; ICML GenBio Best Paper Award 2025
♻ ☆ Learning to Insert for Constructive Neural Vehicle Routing Solver NeurIPS 2025
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
comment: Accepted at NeurIPS 2025
Self-Evolving Curriculum for LLM Reasoning
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
♻ ☆ Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning
Neural Combinatorial Optimization (NCO) has emerged as a promising learning-based paradigm for addressing Vehicle Routing Problems (VRPs) by minimizing the need for extensive manual engineering. While existing NCO methods, trained on small-scale instances (e.g., 100 nodes), have demonstrated considerable success on problems of similar scale, their performance significantly degrades when applied to large-scale scenarios. This degradation arises from the distributional shift between training and testing data, rendering policies learned on small instances ineffective for larger problems. To overcome this limitation, we introduce a novel learning framework driven by Large Language Models (LLMs). This framework learns a projection between the training and testing distributions, which is then deployed to enhance the scalability of the NCO model. Notably, unlike prevailing techniques that necessitate joint training with the neural network, our approach operates exclusively during the inference phase, obviating the need for model retraining. Extensive experiments demonstrate that our method enables a backbone model (trained on 100-node instances) to achieve superior performance on large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) of up to 100K nodes from diverse distributions.
comment: arXiv admin note: text overlap with arXiv:2505.24627
♻ ☆ Rough Path Signatures: Learning Neural RDEs for Portfolio Optimization
We tackle high-dimensional, path-dependent valuation and control and introduce a deep BSDE/2BSDE solver that couples truncated log-signatures with a neural rough differential equation (RDE) backbone. The architecture aligns stochastic analysis with sequence-to-path learning: a CVaR-tilted terminal objective targets left-tail risk, while an optional second-order (2BSDE) head supplies curvature estimates for risk-sensitive control. Under matched compute and parameter budgets, the method improves accuracy, tail fidelity, and training stability across Asian and barrier option pricing and portfolio control: at d=200 it achieves CVaR(0.99)=9.80% versus 12.00-13.10% for strong baselines, attains the lowest HJB residual (0.011), and yields the lowest RMSEs for Z and Gamma. Ablations over truncation depth, local windows, and tilt parameters confirm complementary gains from the sequence-to-path representation and the 2BSDE head. Taken together, the results highlight a bidirectional dialogue between stochastic analysis and modern deep learning: stochastic tools inform representations and objectives, while sequence-to-path models expand the class of solvable financial models at scale.
comment: Code available at: https://github.com/AliAtiah/SigRDE
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles
Recommender systems (RS) greatly influence users' consumption decisions, making them attractive targets for malicious shilling attacks that inject fake user profiles to manipulate recommendations. Existing shilling methods can generate effective and stealthy fake profiles when training data only contain rating matrix, but they lack comprehensive solutions for scenarios where side features are present and utilized by the recommender. To address this gap, we extend the Leg-UP framework by enhancing the generator architecture to incorporate side features, enabling the generation of side-feature-aware fake user profiles. Experiments on benchmarks show that our method achieves strong attack performance while maintaining stealthiness.
♻ ☆ Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
♻ ☆ Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity
Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold cases in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, in this paper we investigate direct sampling of the Euclidean diffusion models for general manifold-structured data. We reveal the multiscale singularity of the score function in the ambient space, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by decomposing it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which reduces the scale discrepancies in the score function by utilizing a non-isotropic noise, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.
♻ ☆ Send Less, Save More: Energy-Efficiency Benchmark of Embedded CNN Inference vs. Data Transmission in IoT
The integration of the Internet of Things (IoT) and Artificial Intelligence offers significant opportunities to enhance our ability to monitor and address ecological changes. As environmental challenges become increasingly pressing, the need for effective remote monitoring solutions is more critical than ever. A major challenge in designing IoT applications for environmental monitoring - particularly those involving image data - is to create energy-efficient IoT devices capable of long-term operation in remote areas with limited power availability. Advancements in the field of Tiny Machine Learning allow the use of Convolutional Neural Networks (CNNs) on resource-constrained, battery-operated microcontrollers. Since data transfer is energy-intensive, performing inference directly on microcontrollers to reduce the message size can extend the operational lifespan of IoT nodes. This work evaluates the use of common Low Power Wide Area Networks and compressed CNNs trained on domain specific datasets on an ESP32-S3. Our experiments demonstrate, among other things, that executing CNN inference on-device and transmitting only the results reduces the overall energy consumption by a factor of up to five compared to sending raw image data. These findings advocate the development of IoT applications with reduced carbon footprint and capable of operating autonomously in environmental monitoring scenarios by incorporating EmbeddedML.
comment: 11 Pages, Paper lists the categories for the ACM Computing Classification System
♻ ☆ AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation NeurIPS 2025
Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods--whether physics-based or deep learning-based--are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes. Our implementation is publicly available at https://github.com/Wiley-Z/AANet.
comment: Accepted at NeurIPS 2025
♻ ☆ Parallel Unlearning in Inherited Model Networks
Unlearning is challenging in generic learning frameworks with the continuous growth and updates of models exhibiting complex inheritance relationships. This paper presents a novel unlearning framework that enables fully parallel unlearning among models exhibiting inheritance. We use a chronologically Directed Acyclic Graph (DAG) to capture various unlearning scenarios occurring in model inheritance networks. Central to our framework is the Fisher Inheritance Unlearning (FIUn) method, designed to enable efficient parallel unlearning within the DAG. FIUn utilizes the Fisher Information Matrix (FIM) to assess the significance of model parameters for unlearning tasks and adjusts them accordingly. To handle multiple unlearning requests simultaneously, we propose the Merging-FIM (MFIM) function, which consolidates FIMs from multiple upstream models into a unified matrix. This design supports all unlearning scenarios captured by the DAG, enabling one-shot removal of inherited knowledge while significantly reducing computational overhead. Experiments confirm the effectiveness of our unlearning framework. For single-class tasks, it achieves complete unlearning with 0% accuracy for unlearned labels while maintaining 94.53% accuracy for retained labels. For multi-class tasks, the accuracy is 1.07% for unlearned labels and 84.77% for retained labels. Our framework accelerates unlearning by 99% compared to alternative methods. Code is in https://github.com/MJLee00/Parallel-Unlearning-in-Inherited-Model-Networks.
♻ ☆ Similarity-Distance-Magnitude Activations
We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to co-variate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.
comment: 18 pages, 5 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167
♻ ☆ A Robust and Non-Iterative Tensor Decomposition Method with Automatic Thresholding
Recent advances in IoT and biometric sensing technologies have led to the generation of massive and high-dimensional tensor data, yet achieving accurate and efficient low-rank approximation remains a major challenge. Existing tensor decomposition methods typically require prior specification of the tensor rank and rely on iterative optimization, which often results in heavy computational costs and dependence on the analyst's expertise. In this study, we propose a novel low-rank approximation method for tensor data that requires neither prior rank specification nor iterative optimization. The proposed method performs statistical singular value hard thresholding on the mode-wise unfolded matrices to automatically extract only statistically significant components, thereby achieving noise reduction while preserving the intrinsic tensor structure. Theoretically, the optimal threshold for each mode is derived based on the asymptotic properties of the Mar\v{c}enko--Pastur distribution. Simulation experiments demonstrate that the proposed method outperforms conventional approaches such as Higher-Order Singular Value Decomposition, Higher-Order Orthogonal Iteration, and Tucker-L2E in terms of both estimation accuracy and computational efficiency. These results indicate that our method provides an effective and theoretically grounded framework for automatic, non-iterative, and analyst-independent tensor decomposition.
comment: 25 pages, 3 figures
♻ ☆ Buffer layers for Test-Time Adaptation NeurIPS 2025
In recent advancements in Test Time Adaptation (TTA), most existing methodologies focus on updating normalization layers to adapt to the test domain. However, the reliance on normalization-based adaptation presents key challenges. First, normalization layers such as Batch Normalization (BN) are highly sensitive to small batch sizes, leading to unstable and inaccurate statistics. Moreover, normalization-based adaptation is inherently constrained by the structure of the pre-trained model, as it relies on training-time statistics that may not generalize well to unseen domains. These issues limit the effectiveness of normalization-based TTA approaches, especially under significant domain shift. In this paper, we introduce a novel paradigm based on the concept of a Buffer layer, which addresses the fundamental limitations of normalization layer updates. Unlike existing methods that modify the core parameters of the model, our approach preserves the integrity of the pre-trained backbone, inherently mitigating the risk of catastrophic forgetting during online adaptation. Through comprehensive experimentation, we demonstrate that our approach not only outperforms traditional methods in mitigating domain shift and enhancing model robustness, but also exhibits strong resilience to forgetting. Furthermore, our Buffer layer is modular and can be seamlessly integrated into nearly all existing TTA frameworks, resulting in consistent performance improvements across various architectures. These findings validate the effectiveness and versatility of the proposed solution in real-world domain adaptation scenarios. The code is available at https://github.com/hyeongyu-kim/Buffer_TTA.
comment: Accepted at NeurIPS 2025
♻ ☆ Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update NeurIPS 2025
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.
comment: NeurIPS 2025
♻ ☆ Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention
Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at https://github.com/EonHao/Holographic-Transformers.
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Accurate predictive model of band gap with selected important features based on explainable machine learning
In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to, or even deteriorate, model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets (0.254 vs. 0.247 eV) while demonstrating superior generalization with lower prediction errors on out-of-domain data (0.461 vs. 0.341 eV). Additionally, the study underscores the necessity for eliminating strongly correlated features (correlation coefficient greater than 0.8) to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.
comment: 9 pages, 3 figures, SI is included
♻ ☆ Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
♻ ☆ Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.
comment: 10 pages, 4 figures; submitted to The IEICE Transactions on Information and Systems
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge. Accepted by SenSys 2026
♻ ☆ Energy Approach from $\varepsilon$-Graph to Continuum Diffusion Model with Connectivity Functional
We derive an energy-based continuum limit for $\varepsilon$-graphs endowed with a general connectivity functional. We prove that the discrete energy and its continuum counterpart differ by at most $O(\varepsilon)$; the prefactor involves only the $W^{1,1}$-norm of the connectivity density as $\varepsilon\to0$, so the error bound remains valid even when that density has strong local fluctuations. As an application, we introduce a neural-network procedure that reconstructs the connectivity density from edge-weight data and then embeds the resulting continuum model into a brain-dynamics framework. In this setting, the usual constant diffusion coefficient is replaced by the spatially varying coefficient produced by the learned density, yielding dynamics that differ significantly from those obtained with conventional constant-diffusion models.
♻ ☆ Model Provenance Testing for Large Language Models
Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts. Tracking model origins is crucial both for protecting intellectual property and for identifying derived models when biases or vulnerabilities are discovered in foundation models. We address this challenge by developing a framework for testing model provenance: Whether one model is derived from another. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90-95% precision and 80-90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
♻ ☆ Omni-Mol: Multitask Molecular Model for Any-to-any Modalities NeurIPS 2025
In the molecular domain, numerous studies have explored the use of multimodal large language models (LLMs) to construct a general-purpose, multi-task molecular model. However, these efforts are still far from achieving a truly universal molecular model. We identify three key challenges in this endeavor: (1) Existing molecular task datasets are typically small in scale and lack comprehensive domain coverage. (2) Tasks from different molecular subfields are difficult to effectively learn jointly through LLMs due to significant distributional shifts and competition among tasks, which introduces instability in the learning process. (3) Both inter-task and intra-task molecular representations demand different intrinsic dimensions in the language space, making it challenging to balance between redundancy and insufficiency in language model representations. To address these challenges, we innovatively categorize existing small-molecule tasks into four types: Mol2Mol, Mol2Text, Mol2Num, and Text2Mol. We then collect a dataset encompassing over 16 tasks with more than 1.4 million samples, making it the largest molecular instruction-tuning dataset to date. Leveraging the extensive pretraining of LLMs on existing chemical literature, we propose a novel multimodal LLM framework, named Omni-Mol, which unifies all small-molecule tasks and supports both molecular generation and understanding. The core of Omni-Mol is our proposed MoGE, which dynamically adapts to the intrinsic rank of different tasks. This mixture-of-experts architecture enhances the model's ability to handle diverse tasks and modalities effectively. Our model achieves unified instruction tuning across 16 tasks and attains state-of-the-art performance on 13 of them. Extensive experiments further demonstrate the scalability and versatility of Omni-Mol.
comment: 44 pages, 9 figures, 13 tables, paper accepted by NeurIPS 2025
♻ ☆ Revealing Multimodal Causality with Large Language Models NeurIPS 2025
Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.
comment: Accepted at NeurIPS 2025
♻ ☆ HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location NeurIPS 2025
Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like data synthesis. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving SLOs. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation. Our evaluation on production workloads shows that HyGen achieves up to 3.9-5.8x throughput gains over online and hybrid serving baselines, while ensuring latency SLOs. The code of HyGen is publicly available at https://github.com/UIUC-MLSys/HyGen.
comment: NeurIPS 2025. 21 pages, 17 figures
♻ ☆ Is Grokking a Computational Glass Relaxation?
Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
♻ ☆ Improving LLM Safety Alignment with Dual-Objective Optimization ICML 2025
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
comment: ICML 2025
♻ ☆ Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding
Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.
comment: 10 pages, 6 figures. Equal contribution: Shakeel Abdulkareem and Bora Yimenicioglu. Compiled with pdfLaTeX (wlscirep class)
♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
♻ ☆ DPA: A one-stop metric to measure bias amplification in classification datasets NeurIPS 2025
Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations -- some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification. To solve these issues, recent work proposed a predictability-based metric called leakage amplification (LA). However, LA cannot identify the direction in which a model amplifies biases. We propose Directional Predictability Amplification (DPA), a predictability-based metric that is (1) directional, (2) works with balanced and unbalanced datasets, and (3) correctly identifies positive and negative bias amplification. DPA eliminates the need to evaluate models on multiple metrics to verify these three aspects. DPA also improves over prior predictability-based metrics like LA: it is less sensitive to the choice of attacker function (a hyperparameter in predictability-based metrics), reports scores within a bounded range, and accounts for dataset bias by measuring relative changes in predictability. Our experiments on well-known datasets like COMPAS (a tabular dataset), COCO, and ImSitu (image datasets) show that DPA is the most reliable metric to measure bias amplification in classification problems. To compare DPA with existing bias amplification metrics, we released a one-stop library of major bias amplification metrics at https://github.com/kerner-lab/Bias-Amplification.
comment: Accepted at NeurIPS 2025
♻ ☆ Game Theoretic Resilience Recommendation Framework for CyberPhysical Microgrids Using Hypergraph MetaLearning
This paper presents a physics-aware cyberphysical resilience framework for radial microgrids under coordinated cyberattacks. The proposed approach models the attacker through a hypergraph neural network (HGNN) enhanced with model agnostic metalearning (MAML) to rapidly adapt to evolving defense strategies and predict high-impact contingencies. The defender is modeled via a bi-level Stackelberg game, where the upper level selects optimal tie-line switching and distributed energy resource (DER) dispatch using an Alternating Direction Method of Multipliers (ADMM) coordinator embedded within the Non-dominated Sorting Genetic Algorithm II (NSGA-II). The framework simultaneously optimizes load served, operational cost, and voltage stability, ensuring all post-defense states satisfy network physics constraints. The methodology is first validated on the IEEE 69-bus distribution test system with 12 DERs, 8 critical loads, and 5 tie-lines, and then extended to higher bus systems including the IEEE 123-bus feeder and a synthetic 300-bus distribution system. Results show that the proposed defense strategy restores nearly full service for 90% of top-ranked attacks, mitigates voltage violations, and identifies Feeder 2 as the principal vulnerability corridor. Actionable operating rules are derived, recommending pre-arming of specific tie-lines to enhance resilience, while higher bus system studies confirm scalability of the framework on the IEEE 123-bus and 300-bus systems.
♻ ☆ Representative Social Choice: From Learning Theory to AI Alignment NeurIPS 2024
Social choice theory is the study of preference aggregation across a population, used both in mechanism design for human agents and in the democratic alignment of language models. In this study, we propose the representative social choice framework for the modeling of democratic representation in collective decisions, where the number of issues and individuals are too large for mechanisms to consider all preferences directly. These scenarios are widespread in real-world decision-making processes, such as jury trials, legislation, corporate governance, and, more recently, language model alignment. In representative social choice, the population is represented by a finite sample of individual-issue pairs based on which social choice decisions are made. We show that many of the deepest questions in representative social choice can be formulated as statistical learning problems, and prove the generalization properties of social choice mechanisms using the theory of machine learning. We further formulate axioms for representative social choice, and prove Arrow-like impossibility theorems with new combinatorial tools of analysis. Our framework introduces the representative approach to social choice, opening up research directions at the intersection of social choice, learning theory, and AI alignment.
comment: Journal of Artificial Intelligence Research, in press. Best Paper at NeurIPS 2024 Pluralistic Alignment Workshop
♻ ☆ DeepOSets: Non-Autoregressive In-Context Learning with Permutation-Invariance Inductive Bias
In-context learning (ICL) is the remarkable ability displayed by some machine learning models to learn from examples provided in a user prompt without any model parameter updates. ICL was first observed in the domain of large language models, and it has been widely assumed that it is a product of the attention mechanism in autoregressive transformers. In this paper, using stylized regression learning tasks, we demonstrate that ICL can emerge in a non-autoregressive neural architecture with a hard-coded permutation-invariance inductive bias. This novel architecture, called DeepOSets, combines the set learning properties of the DeepSets architecture with the operator learning capabilities of Deep Operator Networks (DeepONets). We provide a representation theorem for permutation-invariant regression learning operators and prove that DeepOSets are universal approximators of this class of operators. We performed comprehensive numerical experiments to evaluate the capabilities of DeepOSets in learning linear, polynomial, and shallow neural network regression, under varying noise levels, dimensionalities, and sample sizes. In the high-dimensional regime, accuracy was enhanced by replacing the DeepSets layer with a Set Transformer. Our results show that DeepOSets deliver accurate and fast results with an order of magnitude fewer parameters than a comparable transformer-based alternative.
comment: Set transformer results in the high-dimensional (d=20) case were added; there is a revised proof of Theorem 1; minor edits were made throughout
♻ ☆ Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse. We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment. Crucially, this is not a strict trade-off. We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment. Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.
♻ ☆ The cell as a token: high-dimensional geometry in language models and cell embeddings
Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.
comment: 4 pages, 2 figures
♻ ☆ Why Isn't Relational Learning Taking Over the World? AAAI-2026
Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can't be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world -- except in a few cases with restricted relations -- and what needs to be done to bring it to it's rightful prominence.
comment: 10 pages (6 pages + references + appendices). To appear AAAI-2026
♻ ☆ DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.
♻ ☆ When In Doubt, Abstain: The Impact of Abstention on Strategic Classification
Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents' responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal's utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.
♻ ☆ Scaling Tractable Probabilistic Circuits: A Systems Perspective
Probabilistic Circuits (PCs) are a general framework for tractable deep generative models, which support exact and efficient probabilistic inference on their learned distributions. Recent modeling and training advancements have enabled their application to complex real-world tasks. However, the time and memory inefficiency of existing PC implementations hinders further scaling up. This paper proposes PyJuice, a general GPU implementation design for PCs that improves prior art in several regards. Specifically, PyJuice is 1-2 orders of magnitude faster than existing systems (including very recent ones) at training large-scale PCs. Moreover, PyJuice consumes 2-5x less GPU memory, which enables us to train larger models. At the core of our system is a compilation process that converts a PC into a compact representation amenable to efficient block-based parallelization, which significantly reduces IO and makes it possible to leverage Tensor Cores available in modern GPUs. Empirically, PyJuice can be used to improve state-of-the-art PCs trained on image (e.g., ImageNet32) and language (e.g., WikiText, CommonGen) datasets. We further establish a new set of baselines on natural image and language datasets by benchmarking existing PC structures but with much larger sizes and more training epochs, with the hope of incentivizing future research. Code is available at https://github.com/Tractables/pyjuice.
♻ ☆ Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization ICML 2025
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
comment: 41 pages, 3 figures, 2 tables. Published in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
♻ ☆ A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification
In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture, at once, dependencies among variables and across time points. The objective of this systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and 366 papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive review of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in their studies. To the best of our knowledge, this is the first and broadest systematic literature review presenting a detailed comparison of results from current spatio-temporal GNN models applied to different domains. In its final part, this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability. This paper is complemented by a GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git providing additional interactive tools to further explore the presented findings.
comment: Complementary GitHub repository at https://github.com/FlaGer99/SLR-Spatio-Temporal-GNN.git
♻ ☆ Accelerating Diffusion LLMs via Adaptive Parallel Decoding
The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
comment: 10 pages, 5 figures
♻ ☆ Decoding Virtual Healthcare Success through Knowledge-Aware and Multimodal Predictive Modeling
Online healthcare consultations have transformed how patients seek medical advice, offering convenience while introducing new challenges for ensuring consultation success. Predicting whether an online consultation will be successful is critical for improving patient experiences and sustaining platform competitiveness. Yet, such prediction is inherently difficult due to the fragmented nature of patients' care journeys and the lack of integration between virtual and traditional healthcare systems. Furthermore, the data collected from online platforms, including textual conversations, interaction sequences, and behavioral traces, are often sparse and incomplete. This study develops a predictive modeling approach that fuses multimodal data and dynamically constructed knowledge networks to capture latent relationships among patients, physicians, and consultation contexts. By integrating heterogeneous information sources and uncovering the evolving structure of digital interactions, the model enhances the accuracy and interpretability of consultation success prediction. The findings offer implications for designing hybrid healthcare ecosystems that combine online and offline services through data-driven intelligence.
♻ ☆ AERO: Entropy-Guided Framework for Private LLM Inference
Privacy-preserving computation enables language model inference directly on encrypted data yet suffers from prohibitive latency and communication overheads, primarily due to nonlinear functions. Removing nonlinearities, however, can trigger one of two failure modes restricting the potential for nonlinearity removal: entropy collapse in deeper layers, which destabilizes training, and entropic overload in early layers, causing under-utilization of attention heads. To address these challenges, we introduce AERO, an entropy-guided framework to strategically eliminates costly nonlinear operations from transformer architectures, which employs an adaptive recalibration through a head-wise entropy regularizer with learnable per-head strengths, enabling each head to adjust its entropy level while penalizing extreme entropies and fostering functional diversity through a tolerance margin. Experiments show AERO can save 3.4$\times$ communication and 1.4$\times$ latency, without any performance penalty.
comment: Revised and retitled from "AERO: Softmax-Only LLMs for Efficient Private Inference''. This version focuses on the deployable AERO pipeline (LayerNorm-free + ReLU + entropy regularization); Softmax-only variants are kept as stress-test ablations
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ Scalable Best-of-N Selection for Large Language Models via Self-Certainty NeurIPS 2025
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
comment: NeurIPS 2025
♻ ☆ (How) Do Language Models Track State?
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that seem to require tracking the unobserved state of an evolving world. How do they do this? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, and then refines this with an associative scan. LMs that learn the former algorithm tend to generalize better and converge faster, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pre-trained or fine-tuned, can learn to implement efficient and interpretable state-tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
comment: 21 pages, 17 figures, 1 table. Code: http://github.com/belindal/state-tracking
♻ ☆ Graph Semi-Supervised Learning for Point Classification on Data Manifolds
We propose a graph semi-supervised learning framework for classification tasks on data manifolds. Motivated by the manifold hypothesis, we model data as points sampled from a low-dimensional manifold $\mathcal{M} \subset \mathbb{R}^F$. The manifold is approximated in an unsupervised manner using a variational autoencoder (VAE), where the trained encoder maps data to embeddings that represent their coordinates in $\mathbb{R}^F$. A geometric graph is constructed with Gaussian-weighted edges inversely proportional to distances in the embedding space, transforming the point classification problem into a semi-supervised node classification task on the graph. This task is solved using a graph neural network (GNN). Our main contribution is a theoretical analysis of the statistical generalization properties of this data-to-manifold-to-graph pipeline. We show that, under uniform sampling from $\mathcal{M}$, the generalization gap of the semi-supervised task diminishes with increasing graph size, up to the GNN training error. Leveraging a training procedure which resamples a slightly larger graph at regular intervals during training, we then show that the generalization gap can be reduced even further, vanishing asymptotically. Finally, we validate our findings with numerical experiments on image classification benchmarks, demonstrating the empirical effectiveness of our approach.
comment: 16 pages, 3 figures
♻ ☆ GAIA: A Foundation Model for Operational Atmospheric Dynamics
We introduce GAIA (Geospatial Artificial Intelligence for Atmospheres), a hybrid self-supervised geospatial foundation model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO) to generate semantically rich representations from global geostationary satellite imagery. Pre-trained on 15 years of globally-merged infrared observations (2001-2015), GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns, as evidenced by distributed principal component structure and temporal coherence analysis. We demonstrate robust reconstruction capabilities across varying data availability (30-95% masking), achieving superior gap-filling performance on real missing data patterns. When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline: improving atmospheric river segmentation (F1: 0.58 vs 0.52), enhancing tropical cyclone detection (storm-level recall: 81% vs 75%, early detection: 29% vs 17%), and maintaining competitive precipitation estimation performance. Analysis reveals that GAIA's hybrid objectives encourage learning of spatially coherent, object-centric features distributed across multiple principal components rather than concentrated representations focused on reconstruction. This work demonstrates that combining complementary self-supervised objectives yields more transferable representations for diverse atmospheric modeling tasks. Model weights and code are available at: https://huggingface.co/bcg-usra-nasa-gaia/GAIA-v1.
comment: 23 pages, 11 figures
♻ ☆ Reevaluating Theoretical Analysis Methods for Optimization in Deep Learning
There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. In this paper, we carefully measure the degree to which the standard optimization analyses are capable of explaining modern algorithms. To do this, we develop new empirical metrics that compare real optimization behavior with analytically predicted behavior. Our investigation is notable for its tight integration with modern optimization analysis: rather than simply checking high-level assumptions made in the analysis (e.g. smoothness), we also verify key low-level identities used by the analysis to explain optimization behavior that might hold even if the high-level motivating assumptions do not. Notably, we find that smoothness-based analyses fail in practice under most scenarios, but the key identities commonly used in convex-optimization analyses often hold in practice despite the objective's global non-convexity.
♻ ☆ Agnostic Tomography of Stabilizer Product States
We define a quantum learning task called agnostic tomography, where given copies of an arbitrary state $\rho$ and a class of quantum states $\mathcal{C}$, the goal is to output a succinct description of a state that approximates $\rho$ at least as well as any state in $\mathcal{C}$ (up to some small error $\varepsilon$). This task generalizes ordinary quantum tomography of states in $\mathcal{C}$ and is more challenging because the learning algorithm must be robust to perturbations of $\rho$. We give an efficient agnostic tomography algorithm for the class $\mathcal{C}$ of $n$-qubit stabilizer product states. Assuming $\rho$ has fidelity at least $\tau$ with a stabilizer product state, the algorithm runs in time $n^{O(1 + \log(1/\tau))} / \varepsilon^2$. This runtime is quasipolynomial in all parameters, and polynomial if $\tau$ is a constant.
comment: 21 pages. V2: minor corrections. V3: addition of new references. V4: reworked the algorithm and presentation
♻ ☆ Is Limited Participant Diversity Impeding EEG-based Machine Learning?
The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research. The code for our experiments is publicly available online.
♻ ☆ Hankel Singular Value Regularization for Highly Compressible State Space Models NeurIPS 2025
Deep neural networks using state space models as layers are well suited for long-range sequence tasks but can be challenging to compress after training. We use that regularizing the sum of Hankel singular values of state space models leads to a fast decay of these singular values and thus to compressible models. To make the proposed Hankel singular value regularization scalable, we develop an algorithm to efficiently compute the Hankel singular values during training iterations by exploiting the specific block-diagonal structure of the system matrices that we use in our state space model parametrization. Experiments on Long Range Arena benchmarks demonstrate that the regularized state space layers are up to 10$\times$ more compressible than standard state space layers while maintaining high accuracy.
comment: Accepted at NeurIPS 2025
♻ ☆ PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.
♻ ☆ ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
comment: 10 pages main text, 17 pages total
♻ ☆ RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs
Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
comment: 6 pages
♻ ☆ TokenBlowUp: Resolving Representational Singularities in LLM Token Spaces via Monoidal Transformations
Recent work has provided compelling evidence challenging the foundational manifold hypothesis for the token embedding spaces of Large Language Models (LLMs). These findings reveal the presence of geometric singularities around polysemous tokens, which can lead to representational instability. Existing methodologies, which presuppose a smooth data manifold, are ill-equipped to address such intrinsic structural flaws. In this paper, we formalize this problem in the language of scheme theory and propose a rigorous resolution by applying the scheme-theoretic blow-up at each singular point. This procedure replaces a singular point in the ambient affine scheme with its exceptional divisor, which we identify as a canonical geometric space -- a projective space of directions -- that houses the disambiguated semantic meanings of the token. This process of ``representational desingularization'' constructs a new geometric landscape for embeddings. We prove a formal theorem guaranteeing the geometric regularization of this new space, showing that the original pathologies are resolved. Finally, we outline the architectural implications of our framework, arguing for a paradigm shift from static look-ups to dynamic, geometrically-grounded computation.
comment: Multiple senior math professors pointed out technical flaws of this work, i.e., weak connection between the proposed work and LLM applications. In order not to mislead readers, we would like to withdraw this preprint
♻ ☆ Graph Diffusion that can Insert and Delete NeurIPS 2025
Generative models of graphs based on discrete Denoising Diffusion Probabilistic Models (DDPMs) offer a principled approach to molecular generation by systematically removing structural noise through iterative atom and bond adjustments. However, existing formulations are fundamentally limited by their inability to adapt the graph size (that is, the number of atoms) during the diffusion process, severely restricting their effectiveness in conditional generation scenarios such as property-driven molecular design, where the targeted property often correlates with the molecular size. In this paper, we reformulate the noising and denoising processes to support monotonic insertion and deletion of nodes. The resulting model, which we call GrIDDD, dynamically grows or shrinks the chemical graph during generation. GrIDDD matches or exceeds the performance of existing graph diffusion models on molecular property targeting despite being trained on a more difficult problem. Furthermore, when applied to molecular optimization, GrIDDD exhibits competitive performance compared to specialized optimization models. This work paves the way for size-adaptive molecular generation with graph diffusion.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Equation 2 is now correct
Graphics 8
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
comment: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ The Impact and Outlook of 3D Gaussian Splatting
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
comment: Article written for Frontiers of Science Award, International Congress on Basic Science, 2025
☆ Look at That Distractor: Dynamic Translation Gain under Low Perceptual Load in Virtual Reality
Redirected walking utilizes gain adjustments within perceptual thresholds to allow natural navigation in large scale virtual environments within confined physical environments. Previous research has found that when users are distracted by some scene elements, they are less sensitive to gain values. However, the effects on detection thresholds have not been quantitatively measured. In this paper, we present a novel method that dynamically adjusts translation gain by leveraging visual distractors. We place distractors within the user's field of view and apply a larger translation gain when their attention is drawn to them. Because the magnitude of gain adjustment depends on the user's level of engagement with the distractors, the redirection process remains smooth and unobtrusive. To evaluate our method, we developed a task oriented virtual environment for a user study. Results show that introducing distractors in the virtual environment significantly raises users' translation gain thresholds. Furthermore, assessments using the Simulator Sickness Questionnaire and Igroup Presence Questionnaire indicate that the method maintains user comfort and acceptance, supporting its effectiveness for RDW systems.
☆ StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.
♻ ☆ Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
While dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.
♻ ☆ 3D Dynamic Fluid Assets from Single-View Videos with Generative Gaussian Splatting
While the generation of 3D content from single-view images has been extensively studied, the creation of physically consistent 3D dynamic scenes from videos remains in its early stages. We propose a novel framework leveraging generative 3D Gaussian Splatting (3DGS) models to extract and re-simulate 3D dynamic fluid objects from single-view videos using simulation methods. The fluid geometry represented by 3DGS is initially generated and optimized from single-view images, then denoised, densified, and aligned across frames. We estimate the fluid surface velocity using optical flow, propose a mainstream extraction algorithm to refine it. The 3D volumetric velocity field is then derived from the velocity of the fluid's enclosed surface. The velocity field is therewith converted into a divergence-free, grid-based representation, enabling the optimization of simulation parameters through its differentiability across frames. This process outputs simulation-ready fluid assets with physical dynamics closely matching those observed in the source video. Our approach is applicable to various liquid fluids, including inviscid and viscous types, and allows users to edit the output geometry or extend movement durations seamlessly. This automatic method for creating 3D dynamic fluid assets from single-view videos, easily obtainable from the internet, shows great potential for generating large-scale 3D fluid assets at a low cost.
Robotics 56
☆ STITCH 2.0: Extending Augmented Suturing with EKF Needle Estimation and Thread Management
Surgical suturing is a high-precision task that impacts patient healing and scarring. Suturing skill varies widely between surgeons, highlighting the need for robot assistance. Previous robot suturing works, such as STITCH 1.0 [1], struggle to fully close wounds due to inaccurate needle tracking and poor thread management. To address these challenges, we present STITCH 2.0, an elevated augmented dexterity pipeline with seven improvements including: improved EKF needle pose estimation, new thread untangling methods, and an automated 3D suture alignment algorithm. Experimental results over 15 trials find that STITCH 2.0 on average achieves 74.4% wound closure with 4.87 sutures per trial, representing 66% more sutures in 38% less time compared to the previous baseline. When two human interventions are allowed, STITCH 2.0 averages six sutures with 100% wound closure rate. Project website: https://stitch-2.github.io/
comment: Published in RA-L 2025
☆ GET-USE: Learning Generalized Tool Usage for Bimanual Mobile Manipulation via Simulated Embodiment Extensions
The ability to use random objects as tools in a generalizable manner is a missing piece in robots' intelligence today to boost their versatility and problem-solving capabilities. State-of-the-art robotic tool usage methods focused on procedurally generating or crowd-sourcing datasets of tools for a task to learn how to grasp and manipulate them for that task. However, these methods assume that only one object is provided and that it is possible, with the correct grasp, to perform the task; they are not capable of identifying, grasping, and using the best object for a task when many are available, especially when the optimal tool is absent. In this work, we propose GeT-USE, a two-step procedure that learns to perform real-robot generalized tool usage by learning first to extend the robot's embodiment in simulation and then transferring the learned strategies to real-robot visuomotor policies. Our key insight is that by exploring a robot's embodiment extensions (i.e., building new end-effectors) in simulation, the robot can identify the general tool geometries most beneficial for a task. This learned geometric knowledge can then be distilled to perform generalized tool usage tasks by selecting and using the best available real-world object as tool. On a real robot with 22 degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by 30-60% success rates across three vision-based bimanual mobile manipulation tool-usage tasks.
comment: 8 pages, 7 figures
☆ Modeling Collapse of Steered Vine Robots Under Their Own Weight
Soft, vine-inspired growing robots that move by eversion are highly mobile in confined environments, but, when faced with gaps in the environment, they may collapse under their own weight while navigating a desired path. In this work, we present a comprehensive collapse model that can predict the collapse length of steered robots in any shape using true shape information and tail tension. We validate this model by collapsing several unsteered robots without true shape information. The model accurately predicts the trends of those experiments. We then attempt to collapse a robot steered with a single actuator at different orientations. Our models accurately predict collapse when it occurs. Finally, we demonstrate how this could be used in the field by having a robot attempt a gap-crossing task with and without inflating its actuators. The robot needs its actuators inflated to cross the gap without collapsing, which our model supports. Our model has been specifically tested on straight and series pouch motor-actuated robots made of non-stretchable material, but it could be applied to other robot variations. This work enables us to model the robot's collapse behavior in any open environment and understand the parameters it needs to succeed in 3D navigation tasks.
☆ Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models
We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.
☆ Collision avoidance and path finding in a robotic mobile fulfillment system using multi-objective meta-heuristics
Multi-Agent Path Finding (MAPF) has gained significant attention, with most research focusing on minimizing collisions and travel time. This paper also considers energy consumption in the path planning of automated guided vehicles (AGVs). It addresses two main challenges: i) resolving collisions between AGVs and ii) assigning tasks to AGVs. We propose a new collision avoidance strategy that takes both energy use and travel time into account. For task assignment, we present two multi-objective algorithms: Non-Dominated Sorting Genetic Algorithm (NSGA) and Adaptive Large Neighborhood Search (ALNS). Comparative evaluations show that these proposed methods perform better than existing approaches in both collision avoidance and task assignment.
☆ Learning to Plan & Schedule with Reinforcement-Learned Bimanual Robot Skills
Long-horizon contact-rich bimanual manipulation presents a significant challenge, requiring complex coordination involving a mixture of parallel execution and sequential collaboration between arms. In this paper, we introduce a hierarchical framework that frames this challenge as an integrated skill planning & scheduling problem, going beyond purely sequential decision-making to support simultaneous skill invocation. Our approach is built upon a library of single-arm and bimanual primitive skills, each trained using Reinforcement Learning (RL) in GPU-accelerated simulation. We then train a Transformer-based planner on a dataset of skill compositions to act as a high-level scheduler, simultaneously predicting the discrete schedule of skills as well as their continuous parameters. We demonstrate that our method achieves higher success rates on complex, contact-rich tasks than end-to-end RL approaches and produces more efficient, coordinated behaviors than traditional sequential-only planners.
☆ Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
comment: 13 pages, 6 figures
☆ Incorporating Social Awareness into Control of Unknown Multi-Agent Systems: A Real-Time Spatiotemporal Tubes Approach
This paper presents a decentralized control framework that incorporates social awareness into multi-agent systems with unknown dynamics to achieve prescribed-time reach-avoid-stay tasks in dynamic environments. Each agent is assigned a social awareness index that quantifies its level of cooperation or self-interest, allowing heterogeneous social behaviors within the system. Building on the spatiotemporal tube (STT) framework, we propose a real-time STT framework that synthesizes tubes online for each agent while capturing its social interactions with others. A closed-form, approximation-free control law is derived to ensure that each agent remains within its evolving STT, thereby avoiding dynamic obstacles while also preventing inter-agent collisions in a socially aware manner, and reaching the target within a prescribed time. The proposed approach provides formal guarantees on safety and timing, and is computationally lightweight, model-free, and robust to unknown disturbances. The effectiveness and scalability of the framework are validated through simulation and hardware experiments on a 2D omnidirectional
☆ Using VLM Reasoning to Constrain Task and Motion Planning ICRA 2026
In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.
comment: 8 pages, 7 figures, 1 table. Submitted to ICRA 2026
☆ Octopus-like Reaching Motion: A Perspective Inspired by Whipping
The stereotypical reaching motion of the octopus arm has drawn growing attention for its efficient control of a highly deformable body. Previous studies suggest that its characteristic bend propagation may share underlying principles with the dynamics of a whip. This work investigates whether whip-like passive dynamics in water can reproduce the kinematic features observed in biological reaching and their similarities and differences. Platform-based whipping tests were performed in water and air while systematically varying material stiffness and driving speed. Image-based quantification revealed that the Ecoflex Gel 2 arm driven at 150 rpm (motor speed) reproduced curvature propagation similar to that observed in octopus reaching. However, its bend-point velocity decreased monotonically rather than exhibiting the biological bell-shaped profile, confirming that the octopus reaching movement is not merely a passive whipping behavior. The absence of propagation in air further highlights the critical role of the surrounding medium in forming octopus-like reaching motion. This study provides a new perspective for understand biological reaching movement, and offers a potential platform for future hydrodynamic research.
comment: The first two listed authors contributed equally. Yiyuan Zhang is the corresponding author
☆ Combining Moving Mass Actuators and Manoeuvring Models for Underwater Vehicles: A Lagrangian Approach
In this paper, we present a Newton-Euler formulation of the equations of motion for underwater vehicles with an interntal moving mass actuator. Furthermore, the moving mass dynamics are expressed as an extension to the manoeuvring model for underwater vehicles, originally introduced by Fossen (1991). The influence of the moving mass is described in body-frame and included as states in both an additional kinematic equation and as part of the coupled rigid-body kinetics of the underwater vehicle. The Coriolis-centripetal effects are derived from Kirchhoff's equations and the hydrostatics are derived using first principals. The proposed Newton-Euler model is validated through simulation and compared with the traditional Hamiltonian internal moving mass actuator formulation.
comment: \c{opyright} 2025 Alexander Rambech, Ivar Saksvik and Vahid Hassani. Accepted by IFAC for publication under a Creative Commons License CC-BY-NC-ND
☆ SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments
Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
☆ Solving the Right Problem with Multi-Robot Formations
Formation control simplifies minimizing multi-robot cost functions by encoding a cost function as a shape the robots maintain. However, by reducing complex cost functions to formations, discrepancies arise between maintaining the shape and minimizing the original cost function. For example, a Diamond or Box formation shape is often used for protecting all members of the formation. When more information about the surrounding environment becomes available, a static shape often no longer minimizes the original protection cost. We propose a formation planner to reduce mismatch between a formation and the cost function while still leveraging efficient formation controllers. Our formation planner is a two-step optimization problem that identifies desired relative robot positions. We first solve a constrained problem to estimate non-linear and non-differentiable costs with a weighted sum of surrogate cost functions. We theoretically analyze this problem and identify situations where weights do not need to be updated. The weighted, surrogate cost function is then minimized using relative positions between robots. The desired relative positions are realized using a non-cooperative formation controller derived from Lyapunov's direct approach. We then demonstrate the efficacy of this approach for military-like costs such as protection and obstacle avoidance. In simulations, we show a formation planner can reduce a single cost by over 75%. When minimizing a variety of cost functions simultaneously, using a formation planner with adaptive weights can reduce the cost by 20-40%. Formation planning provides better performance by minimizing a surrogate cost function that closely approximates the original cost function instead of relying on a shape abstraction.
comment: Submitted to SAE WCX 2026
☆ Sim-to-Real Gentle Manipulation of Deformable and Fragile Objects with Stress-Guided Reinforcement Learning
Robotic manipulation of deformable and fragile objects presents significant challenges, as excessive stress can lead to irreversible damage to the object. While existing solutions rely on accurate object models or specialized sensors and grippers, this adds complexity and often lacks generalization. To address this problem, we present a vision-based reinforcement learning approach that incorporates a stress-penalized reward to discourage damage to the object explicitly. In addition, to bootstrap learning, we incorporate offline demonstrations as well as a designed curriculum progressing from rigid proxies to deformables. We evaluate the proposed method in both simulated and real-world scenarios, showing that the policy learned in simulation can be transferred to the real world in a zero-shot manner, performing tasks such as picking up and pushing tofu. Our results show that the learned policies exhibit a damage-aware, gentle manipulation behavior, demonstrating their effectiveness by decreasing the stress applied to fragile objects by 36.5% while achieving the task goals, compared to vanilla RL policies.
comment: Under review
☆ Integrating Legal and Logical Specifications in Perception, Prediction, and Planning for Automated Driving: A Survey of Methods
This survey provides an analysis of current methodologies integrating legal and logical specifications into the perception, prediction, and planning modules of automated driving systems. We systematically explore techniques ranging from logic-based frameworks to computational legal reasoning approaches, emphasizing their capability to ensure regulatory compliance and interpretability in dynamic and uncertain driving environments. A central finding is that significant challenges arise at the intersection of perceptual reliability, legal compliance, and decision-making justifiability. To systematically analyze these challenges, we introduce a taxonomy categorizing existing approaches by their theoretical foundations, architectural implementations, and validation strategies. We particularly focus on methods that address perceptual uncertainty and incorporate explicit legal norms, facilitating decisions that are both technically robust and legally defensible. The review covers neural-symbolic integration methods for perception, logic-driven rule representation, and norm-aware prediction strategies, all contributing toward transparent and accountable autonomous vehicle operation. We highlight critical open questions and practical trade-offs that must be addressed, offering multidisciplinary insights from engineering, logic, and law to guide future developments in legally compliant autonomous driving systems.
comment: Accepted to 2025 IEEE International Automated Vehicle Validation Conference (IAVVC)
☆ Geometric Robot Calibration Using a Calibration Plate
In this paper a new method for geometric robot calibration is introduced, which uses a calibration plate with precisely known distances between its measuring points. The relative measurement between two points on the calibration plate is used to determine predefined error parameters of the system. In comparison to conventional measurement methods, like laser tracker or motion capture systems, the calibration plate provides a more mechanically robust and cheaper alternative, which is furthermore easier to transport due to its small size. The calibration method, the plate design, the mathematical description of the error system as well as the identification of the parameters are described in detail. For identifying the error parameters, the least squares method and a constrained optimization problem are used. The functionality of this method was demonstrated in experiments that led to promising results, correlated with one of a laser tracker calibration. The modeling and identification of the error parameters is done for a gantry machine, but is not restricted to that type of robot.
comment: pp 309-317
☆ An approach for combining transparency and motion assistance of a lower body exoskeleton
In this paper, an approach for gait assistance with a lower body exoskeleton is described. Two concepts, transparency and motion assistance, are combined. The transparent mode, where the system is following the user's free motion with a minimum of perceived interaction forces, is realized by exploiting the gear backlash of the actuation units. During walking a superimposed assistance mode applies an additional torque guiding the legs to their estimated future position. The concept of adaptive oscillators is utilized to learn the quasi-periodic signals typical for locomotion. First experiments showed promising results.
comment: 8 pages
☆ Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design
Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.
comment: The source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI
☆ Development of Implicit-Explicit Control Based Amphibious Centipede-Type Robot and Evaluation of its Mobile Performance
Multi-legged mobile robots possess high mobility performance in rough terrain environments, stemming from their high postural stability, joint flexibility, and the redundancy provided by multiple legs. In prior research on navigating between different environments such as land and water, the primary strategy employed involves switching to a controller that generates an appropriate gait for the new environment upon entering it. However, designing appropriate gaits for each complex and diverse environment and accurately determining controller switching for each environment is challenging. Therefore, this research develops a centipede-type mobile robot that navigates both aquatic and terrestrial environments with a simple, unified control scheme, based on the implicit-explicit control philosophy and by ingeniously designing the robot's body structure. In this research, we developed the robot featuring flexible joints and left and right legs on each body segment and focused on the leg structure which has extensive contact with the environment. This paper evaluates the locomotion performance on land and water using the three developed leg structures, using the robot's leg slip rate and actuator energy consumption as evaluation metrics. The experimental results confirmed the existence of an appropriate leg structure capable of navigating both aquatic and terrestrial environments under identical control.
☆ SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
☆ Time-Optimal Transport of Loosely Placed Liquid Filled Cups along Prescribed Paths
Handling loosely placed objects with robotic manipulators is a difficult task from the point of view of trajectory planning and control. This becomes even more challenging when the object to be handled is a container filled with liquid. This paper addresses the task of transporting a liquid-filled cup placed on a tray along a prescribed path in shortest time. The objective is to minimize swapping, thus avoiding spillage of the fluid. To this end, the sloshing dynamics is incorporated into the dynamic model used within the optimal control problem formulation. The optimization problem is solved using a direct multiple shooting approach.
☆ One-shot Humanoid Whole-body Motion Learning
Whole-body humanoid motion represents a cornerstone challenge in robotics, integrating balance, coordination, and adaptability to enable human-like behaviors. However, existing methods typically require multiple training samples per motion category, rendering the collection of high-quality human motion datasets both labor-intensive and costly. To address this, we propose a novel approach that trains effective humanoid motion policies using only a single non-walking target motion sample alongside readily available walking motions. The core idea lies in leveraging order-preserving optimal transport to compute distances between walking and non-walking sequences, followed by interpolation along geodesics to generate new intermediate pose skeletons, which are then optimized for collision-free configurations and retargeted to the humanoid before integration into a simulated environment for policy training via reinforcement learning. Experimental evaluations on the CMU MoCap dataset demonstrate that our method consistently outperforms baselines, achieving superior performance across metrics. Code will be released upon acceptance.
comment: 10 pages, 3 figures, 5 tables
☆ Hybrid Vision Servoing with Depp Alignment and GRU-Based Occlusion Recovery
Vision-based control systems, such as image-based visual servoing (IBVS), have been extensively explored for precise robot manipulation. A persistent challenge, however, is maintaining robust target tracking under partial or full occlusions. Classical methods like Lucas-Kanade (LK) offer lightweight tracking but are fragile to occlusion and drift, while deep learning-based approaches often require continuous visibility and intensive computation. To address these gaps, we propose a hybrid visual tracking framework that bridges advanced perception with real-time servo control. First, a fast global template matcher constrains the pose search region; next, a deep-feature Lucas-Kanade module operating on early VGG layers refines alignment to sub-pixel accuracy (<2px); then, a lightweight residual regressor corrects local misalignments caused by texture degradation or partial occlusion. When visual confidence falls below a threshold, a GRU-based predictor seamlessly extrapolates pose updates from recent motion history. Crucially, the pipeline's final outputs-translation, rotation, and scale deltas-are packaged as direct control signals for 30Hz image-based servo loops. Evaluated on handheld video sequences with up to 90% occlusion, our system sustains under 2px tracking error, demonstrating the robustness and low-latency precision essential for reliable real-world robot vision applications.
☆ RoadSens-4M: A Multimodal Smartphone & Camera Dataset for Holistic Road-way Analysis
It's important to monitor road issues such as bumps and potholes to enhance safety and improve road conditions. Smartphones are equipped with various built-in sensors that offer a cost-effective and straightforward way to assess road quality. However, progress in this area has been slow due to the lack of high-quality, standardized datasets. This paper discusses a new dataset created by a mobile app that collects sensor data from devices like GPS, accelerometers, gyroscopes, magnetometers, gravity sensors, and orientation sensors. This dataset is one of the few that integrates Geographic Information System (GIS) data with weather information and video footage of road conditions, providing a comprehensive understanding of road issues with geographic context. The dataset allows for a clearer analysis of road conditions by compiling essential data, including vehicle speed, acceleration, rotation rates, and magnetic field intensity, along with the visual and spatial context provided by GIS, weather, and video data. Its goal is to provide funding for initiatives that enhance traffic management, infrastructure development, road safety, and urban planning. Additionally, the dataset will be publicly accessible to promote further research and innovation in smart transportation systems.
☆ SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning
Interpreting visual observations and natural language instructions for complex task execution remains a key challenge in robotics and AI. Despite recent advances, language-driven navigation is still difficult, particularly for UAVs in small-scale 3D environments. Existing Vision-Language Navigation (VLN) approaches are mostly designed for ground robots and struggle to generalize to aerial tasks that require full 3D spatial reasoning. The emergence of large Vision-Language Models (VLMs), such as GPT and Claude, enables zero-shot semantic reasoning from visual and textual inputs. However, these models lack spatial grounding and are not directly applicable to navigation. To address these limitations, SoraNav is introduced, an adaptive UAV navigation framework that integrates zero-shot VLM reasoning with geometry-aware decision-making. Geometric priors are incorporated into image annotations to constrain the VLM action space and improve decision quality. A hybrid switching strategy leverages navigation history to alternate between VLM reasoning and geometry-based exploration, mitigating dead-ends and redundant revisits. A PX4-based hardware-software platform, comprising both a digital twin and a physical micro-UAV, enables reproducible evaluation. Experimental results show that in 2.5D scenarios, our method improves Success Rate (SR) by 25.7% and Success weighted by Path Length (SPL) by 17%. In 3D scenarios, it improves SR by 29.5% and SPL by 18.5% relative to the baseline.
☆ Learning Spatial-Aware Manipulation Ordering NeurIPS 2025
Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.
comment: Accepted to NeurIPS 2025
☆ NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies
Vision-language-action (VLA) models have significantly advanced robotic manipulation by integrating vision-language models (VLMs), and action decoders into a unified architecture. However, their deployment on resource-constrained edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin Nano), remains challenging due to high computational demands, especially in real-world scenarios where power, latency, and computational resources are critical. To close this gap, we introduce Nano-scale Vision-Language Action (NanoVLA), a family of lightweight VLA architectures that achieve high performance with minimal resources. Our core innovations include: (1) vision-language decoupling that moves conventional early vision and language inputs fusion in VLM to late stage, achieving better performance while enabling caching and reduce inference overhead and latency; (2) long-short action chunking to ensure smooth, coherent multi-step planning without sacrificing real-time responsiveness; (3) dynamic routing that adaptively assigns lightweight or heavy backbones based on task complexity, further optimizing inference efficiency. Experimental results on several benchmarks, as well as real-world deployments, demonstrate that NanoVLA achieves up to 52x faster inference on edge devices compared to previous state-of-the-art VLA models, with 98% less parameters while maintaining or surpassing their task accuracy and generalization. Ablation studies confirm that our decoupling strategy preserves cross-task transferability, and the routing module enhances cost-performance trade-offs, enabling practical, high-precision robotic manipulation on resource-constrained hardware.
☆ Mean-Shift Theory and Its Applications in Swarm Robotics: A New Way to Enhance the Efficiency of Multi-Robot Collaboration
Swarms evolving from collective behaviors among multiple individuals are commonly seen in nature, which enables biological systems to exhibit more efficient and robust collaboration. Creating similar swarm intelligence in engineered robots poses challenges to the design of collaborative algorithms that can be programmed at large scales. The assignment-based method has played an eminent role for a very long time in solving collaboration problems of robot swarms. However, it faces fundamental limitations in terms of efficiency and robustness due to its unscalability to swarm variants. This article presents a tutorial review on recent advances in assignment-free collaboration of robot swarms, focusing on the problem of shape formation. A key theoretical component is the recently developed \emph{mean-shift exploration} strategy, which improves the collaboration efficiency of large-scale swarms by dozens of times. Further, the efficiency improvement is more significant as the swarm scale increases. Finally, this article discusses three important applications of the mean-shift exploration strategy, including precise shape formation, area coverage formation, and maneuvering formation, as well as their corresponding industrial scenarios in smart warehousing, area exploration, and cargo transportation.
☆ Non-Invasive Calibration Of A Stewart Platform By Photogrammetry
Accurate calibration of a Stewart platform is important for their precise and efficient operation. However, the calibration of these platforms using forward kinematics is a challenge for researchers because forward kinematics normally generates multiple feasible and unfeasible solutions for any pose of the moving platform. The complex kinematic relations among the six actuator paths connecting the fixed base to the moving platform further compound the difficulty in establishing a straightforward and efficient calibration method. The authors developed a new forward kinematics-based calibration method using Denavit-Hartenberg convention and used the Stewart platform Tiger 66.1 developed in their lab for experimenting with the photogrammetry-based calibration strategies described in this paper. This system became operational upon completion of construction, marking its inaugural use. The authors used their calibration model for estimating the errors in the system and adopted three compensation options or strategies as per Least Square method to improve the accuracy of the system. These strategies leveraged a high-resolution digital camera and off-the-shelf software to capture the poses of the moving platform's center. This process is non-invasive and does not need any additional equipment to be attached to the hexapod or any alteration of the hexapod hardware. This photogrammetry-based calibration process involves multiple high-resolution images from different angles to measure the position and orientation of the platform center in the three-dimensional space. The Target poses and Actual poses are then compared, and the error compensations are estimated using the Least-Squared methods to calculate the Predicted poses. Results from each of the three compensation approaches demonstrated noticeable enhancements in platform pose accuracies, suggesting room for further improvements.
comment: The International Journal of Advanced Manufacturing Technology, 2024
☆ Scalable predictive processing framework for multitask caregiving robots
The rapid aging of societies is intensifying demand for autonomous care robots; however, most existing systems are task-specific and rely on handcrafted preprocessing, limiting their ability to generalize across diverse scenarios. A prevailing theory in cognitive neuroscience proposes that the human brain operates through hierarchical predictive processing, which underlies flexible cognition and behavior by integrating multimodal sensory signals. Inspired by this principle, we introduce a hierarchical multimodal recurrent neural network grounded in predictive processing under the free-energy principle, capable of directly integrating over 30,000-dimensional visuo-proprioceptive inputs without dimensionality reduction. The model was able to learn two representative caregiving tasks, rigid-body repositioning and flexible-towel wiping, without task-specific feature engineering. We demonstrate three key properties: (i) self-organization of hierarchical latent dynamics that regulate task transitions, capture variability in uncertainty, and infer occluded states; (ii) robustness to degraded vision through visuo-proprioceptive integration; and (iii) asymmetric interference in multitask learning, where the more variable wiping task had little influence on repositioning, whereas learning the repositioning task led to a modest reduction in wiping performance, while the model maintained overall robustness. Although the evaluation was limited to simulation, these results establish predictive processing as a universal and scalable computational principle, pointing toward robust, flexible, and autonomous caregiving robots while offering theoretical insight into the human brain's ability to achieve flexible adaptation in uncertain real-world environments.
☆ Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization
Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV's existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV's native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.
comment: 8 pages
☆ RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras
We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.
comment: 8 pages, 9 figures, submitted for review to IEEE RA-L
☆ DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones' high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS's flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
comment: Preprint version. This manuscript is currently under review at Transportation Research Part C: Emerging Technologies. The PDF corresponds to the version submitted in June 2025. The main findings of this work were recognized with the Best Intelligent Transportation Systems Paper Award at the 2025 TRB Annual Meeting
☆ A New Type of Axis-Angle Attitude Control Law for Rotational Systems: Synthesis, Analysis, and Experiments
Over the past few decades, continuous quaternion-based attitude control has been proven highly effective for driving rotational systems that can be modeled as rigid bodies, such as satellites and drones. However, methods rooted in this approach do not enforce the existence of a unique closed-loop (CL) equilibrium attitude-error quaternion (AEQ); and, for rotational errors about the attitude-error Euler axis larger than {\pi}rad, their proportional-control effect diminishes as the system state moves away from the stable equilibrium of the CL rotational dynamics. In this paper, we introduce a new type of attitude control law that more effectively leverages the attitude-error Euler axis-angle information to guarantee a unique CL equilibrium AEQ and to provide greater flexibility in the use of proportional-control efforts. Furthermore, using two different control laws as examples-through the construction of a strict Lyapunov function for the CL dynamics-we demonstrate that the resulting unique equilibrium of the CL rotational system can be enforced to be uniformly asymptotically stable. To assess and demonstrate the functionality and performance of the proposed approach, we performed numerical simulations and executed dozens of real-time tumble-recovery maneuvers using a small quadrotor. These simulations and flight tests compellingly demonstrate that the proposed axis-angle-based method achieves superior flight performance-compared with that obtained using a high-performance quaternion-based controller-in terms of stabilization time.
comment: 2025 International Conference on Advanced Robotics (ICAR)
☆ Curvature-Aware Calibration of Tactile Sensors for Accurate Force Estimation on Non-Planar Surfaces
Flexible tactile sensors are increasingly used in real-world applications such as robotic grippers, prosthetic hands, wearable gloves, and assistive devices, where they need to conform to curved and irregular surfaces. However, most existing tactile sensors are calibrated only on flat substrates, and their accuracy and consistency degrade once mounted on curved geometries. This limitation restricts their reliability in practical use. To address this challenge, we develop a calibration model for a widely used resistive tactile sensor design that enables accurate force estimation on one-dimensional curved surfaces. We then train a neural network (a multilayer perceptron) to predict local curvature from baseline sensor outputs recorded under no applied load, achieving an R2 score of 0.91. The proposed approach is validated on five daily objects with varying curvatures under forces from 2 N to 8 N. Results show that the curvature-aware calibration maintains consistent force accuracy across all surfaces, while flat-surface calibration underestimates force as curvature increases. Our results demonstrate that curvature-aware modeling improves the accuracy, consistency, and reliability of flexible tactile sensors, enabling dependable performance across real-world applications.
comment: This work has been submitted to the IEEE for possible publication
☆ WaveVerif: Acoustic Side-Channel based Verification of Robotic Workflows
In this paper, we present a framework that uses acoustic side-channel analysis (ASCA) to monitor and verify whether a robot correctly executes its intended commands. We develop and evaluate a machine-learning-based workflow verification system that uses acoustic emissions generated by robotic movements. The system can determine whether real-time behavior is consistent with expected commands. The evaluation takes into account movement speed, direction, and microphone distance. The results show that individual robot movements can be validated with over 80% accuracy under baseline conditions using four different classifiers: Support Vector Machine (SVM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). Additionally, workflows such as pick-and-place and packing could be identified with similarly high confidence. Our findings demonstrate that acoustic signals can support real-time, low-cost, passive verification in sensitive robotic environments without requiring hardware modifications.
comment: 11 pages, 3 figures, Corresponding Author: Prof. Shishir Nagaraja (shishir.nagaraja@newcastle.ac.uk)
☆ Risk-Aware Safety Filters with Poisson Safety Functions and Laplace Guidance Fields
Robotic systems navigating in real-world settings require a semantic understanding of their environment to properly determine safe actions. This work aims to develop the mathematical underpinnings of such a representation--specifically, the goal is to develop safety filters that are risk-aware. To this end, we take a two step approach: encoding an understanding of the environment via Poisson's equation, and associated risk via Laplace guidance fields. That is, we first solve a Dirichlet problem for Poisson's equation to generate a safety function that encodes system safety as its 0-superlevel set. We then separately solve a Dirichlet problem for Laplace's equation to synthesize a safe \textit{guidance field} that encodes variable levels of caution around obstacles -- by enforcing a tunable flux boundary condition. The safety function and guidance fields are then combined to define a safety constraint and used to synthesize a risk-aware safety filter which, given a semantic understanding of an environment with associated risk levels of environmental features, guarantees safety while prioritizing avoidance of higher risk obstacles. We demonstrate this method in simulation and discuss how \textit{a priori} understandings of obstacle risk can be directly incorporated into the safety filter to generate safe behaviors that are risk-aware.
☆ BikeScenes: Online LiDAR Semantic Segmentation for Bicycles
The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.
☆ Debate2Create: Robot Co-design via Large Language Model Debates
Automating the co-design of a robot's morphology and control is a long-standing challenge due to the vast design space and the tight coupling between body and behavior. We introduce Debate2Create (D2C), a framework in which large language model (LLM) agents engage in a structured dialectical debate to jointly optimize a robot's design and its reward function. In each round, a design agent proposes targeted morphological modifications, and a control agent devises a reward function tailored to exploit the new design. A panel of pluralistic judges then evaluates the design-control pair in simulation and provides feedback that guides the next round of debate. Through iterative debates, the agents progressively refine their proposals, producing increasingly effective robot designs. Notably, D2C yields diverse and specialized morphologies despite no explicit diversity objective. On a quadruped locomotion benchmark, D2C discovers designs that travel 73% farther than the default, demonstrating that structured LLM-based debate can serve as a powerful mechanism for emergent robot co-design. Our results suggest that multi-agent debate, when coupled with physics-grounded feedback, is a promising new paradigm for automated robot design.
☆ Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks
This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.
☆ Risk-Aware Safety Filters with Poisson Safety Functions and Laplace Guidance Fields
Robotic systems navigating in real-world settings require a semantic understanding of their environment to properly determine safe actions. This work aims to develop the mathematical underpinnings of such a representation -- specifically, the goal is to develop safety filters that are risk-aware. To this end, we take a two step approach: encoding an understanding of the environment via Poisson's equation, and associated risk via Laplace guidance fields. That is, we first solve a Dirichlet problem for Poisson's equation to generate a safety function that encodes system safety as its 0-superlevel set. We then separately solve a Dirichlet problem for Laplace's equation to synthesize a safe \textit{guidance field} that encodes variable levels of caution around obstacles -- by enforcing a tunable flux boundary condition. The safety function and guidance fields are then combined to define a safety constraint and used to synthesize a risk-aware safety filter which, given a semantic understanding of an environment with associated risk levels of environmental features, guarantees safety while prioritizing avoidance of higher risk obstacles. We demonstrate this method in simulation and discuss how \textit{a priori} understandings of obstacle risk can be directly incorporated into the safety filter to generate safe behaviors that are risk-aware.
☆ Force Characterization of Insect-Scale Aquatic Propulsion Based on Fluid-Structure Interaction
We present force characterizations of two newly developed insect-scale propulsors--one single-tailed and one double-tailed--for microrobotic swimmers that leverage fluid-structure interaction (FSI) to generate thrust. The designs of these two devices were inspired by anguilliform swimming and are driven by soft tails excited by high-work-density (HWD) actuators powered by shape-memory alloy (SMA) wires. While these propulsors have been demonstrated to be suitable for microrobotic aquatic locomotion and controllable with simple architectures for trajectory tracking in the two-dimensional (2D) space, the characteristics and magnitudes of the associated forces have not been studied systematically. In the research presented here, we adopted a theoretical framework based on the notion of reactive forces and obtained experimental data for characterization using a custom-built micro-N-resolution force sensor. We measured maximum and cycle-averaged force values with multi-test means of respectively 0.45 mN and 2.97 micro-N, for the tested single-tail propulsor. For the dual-tail propulsor, we measured maximum and cycle-averaged force values with multi-test means of 0.61 mN and 22.6 micro-N, respectively. These results represent the first measurements of the instantaneous thrust generated by insect-scale propulsors of this type and provide insights into FSI for efficient microrobotic propulsion.
comment: To be presented at ICAR 2025 in San Juan, Argentina
♻ ☆ Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
♻ ☆ Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning
Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.
comment: 16 pages, 4 figures, 4 tables
♻ ☆ Optimal Kinematic Synthesis and Prototype Development of Knee Exoskeleton
The range of rotation (RoR) in a knee exoskeleton is a critical factor in rehabilitation, as it directly influences joint mobility, muscle activation, and recovery outcomes. A well-designed RoR ensures that patients achieve near-natural knee kinematics, which is essential for restoring gait patterns and preventing compensatory movements. This paper presents optimal design of one degree of freedom knee exoskeleton. In kinematic analysis, the existing design being represented by nonlinear and nonconvex mathematical functions. To obtain feasible and optimum measurement of the links of knee exoskeleton, an optimization problem is formulated based on the kinematic analysis and average human's leg measurement. The optimized solution increases the range of motion of knee exoskeleton during sit to stand motion by $24 \%$ as compared with inspired design. Furthermore, misalignment study is conducted by comparing the trajectory of human's knee and exoskeleton's knee during sit to stand motion. The joint movement is calculated using marker and camera system. Finally, a prototype of the knee joint exoskeleton is being developed based on optimal dimensions which validate the maximum range of motion achieved during simulation.
♻ ☆ Multi-robot Motion Planning based on Nets-within-Nets Modeling and Simulation
This paper focuses on designing motion plans for a heterogeneous team of robots that must cooperate to fulfill a global mission. Robots move in an environment that contains some regions of interest, while the specification for the entire team can include avoidance, visits, or sequencing of these regions of interest. The mission is expressed in terms of a Petri net corresponding to an automaton, while each robot is also modeled by a state machine Petri net. The current work brings about the following contributions with respect to existing solutions for related problems. First, we propose a novel model, denoted High-Level robot team Petri Net (HLrtPN) system, to incorporate the specification and robot models into the Nets-within-Nets paradigm. A guard function, named Global Enabling Function, is designed to synchronize the firing of transitions so that robot motions do not violate the specification. Then, the solution is found by simulating the HLrtPN system in a specific software tool that accommodates Nets-within-Nets. Illustrative examples based on Linear Temporal Logic missions support the computational feasibility of the proposed framework.
comment: [Note for readers] This paper has been extended from a previous submission to 62nd IEEE Conference on Decision and Control, Dec. 13-15, 2023. This work has been submitted to the IEEE for possible publication
♻ ☆ Control Modes of Teleoperated Surgical Robotic System's Tools in Ophthalmic Surgery
The introduction of a teleoperated surgical robotic system designed for minimally invasive procedures enables the emulation of two distinct control modes through a dedicated input device of the surgical console: (1) Inside Control Mode, which emulates tool manipulation near the distal end as if the surgeon was holding the tip of the instrument inside the patient's body; (2) Outside Control Mode, which emulates manipulation near the proximal end as if the surgeon was holding the tool externally. The aim of this research is to compare the surgeon's performance on these two modes of operation along with various scaling factors in a simulated vitreoretinal surgical setting. The console of Intraocular Robotic Interventional Surgical System (IRISS) was utilized but the surgical robot itself and the human eye anatomy was simulated by a virtual environment projected microscope view of an intraocular setup to a VR headset. Five experienced vitreoretinal surgeons and five subjects with no surgical experience used the system to perform four fundamental tool/tissue tasks common to vitreoretinal surgery: touch and reset; grasp and drop; inject; circular tracking. Results indicate that Inside Control outperforms Outside Control across multiple tasks and metrics. Higher scaling factors generally performed better, particularly for reducing trajectory errors and tissue damage. This improvement suggests that larger scaling factors enable more precise control, making them the preferred option for fine manipulation. However, completion time was not consistently reduced across all conditions, indicating that surgeons need to balance speed and accuracy based on surgical requirements. By optimizing control dynamics and user interface, robotic teleoperation has the potential to reduce complications, enhance dexterity, and expand the accessibility of high precision procedures to a broader range of practitioners.
comment: 10 pages, 11 figures
♻ ☆ RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation NeurIPS 2025
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
comment: 25 pages, 18 figures, Accepted by NeurIPS 2025
♻ ☆ ES-HPC-MPC: Exponentially Stable Hybrid Perception Constrained MPC for Quadrotor with Suspended Payloads
Aerial transportation using quadrotors with cable-suspended payloads holds great potential for applications in disaster response, logistics, and infrastructure maintenance. However, their hybrid and underactuated dynamics pose significant control and perception challenges. Traditional approaches often assume a taut cable condition, limiting their effectiveness in real-world applications where slack-to-taut transitions occur due to disturbances. We introduce ES-HPC-MPC, a model predictive control framework that enforces exponential stability and perception-constrained control under hybrid dynamics. Our method leverages Exponentially Stabilizing Control Lyapunov Functions (ES-CLFs) to enforce stability during the tasks and Control Barrier Functions (CBFs) to maintain the payload within the onboard camera's field of view (FoV). We validate our method through both simulation and real-world experiments, demonstrating stable trajectory tracking and reliable payload perception. We validate that our method maintains stability and satisfies perception constraints while tracking dynamically infeasible trajectories and when the system is subjected to hybrid mode transitions caused by unexpected disturbances.
comment: Accepted to IEEE Robotics and Automation Letters
♻ ☆ A Constrained Saddle Search Approach for Constructing Singular and Flexible Bar Frameworks
Singularity analysis is essential in robot kinematics, as singular configurations cause loss of control and kinematic indeterminacy. This paper models singularities in bar frameworks as saddle points on constrained manifolds. Given an under-constrained, non-singular bar framework, by allowing one edge to vary its length while fixing lengths of others, we define the squared length of the free edge as an energy functional and show that its local saddle points correspond to singular and flexible frameworks. Using our constrained saddle search approach, we identify previously unknown singular and flexible bar frameworks, providing new insights into singular robotics design and analysis.
comment: 9 pages, 3 figures
♻ ☆ STATE-NAV: Stability-Aware Traversability Estimation for Bipedal Navigation on Rough Terrain
Bipedal robots have advantages in maneuvering human-centered environments, but face greater failure risk compared to other stable mobile plarforms such as wheeled or quadrupedal robots. While learning-based traversability has been widely studied for these platforms, bipedal traversability has instead relied on manually designed rules with limited consideration of locomotion stability on rough terrain. In this work, we present the first learning-based traversability estimation and risk-sensitive navigation framework for bipedal robots operating in diverse, uneven environments. TravFormer, a transformer-based neural network, is trained to predict bipedal instability with uncertainty, enabling risk-aware and adaptive planning. Based on the network, we define traversability as stability-aware command velocity-the fastest command velocity that keeps instability below a user-defined limit. This velocity-based traversability is integrated into a hierarchical planner that combines traversability-informed Rapid Random Tree Star (TravRRT*) for time-efficient planning and Model Predictive Control (MPC) for safe execution. We validate our method in MuJoCo simulation and the real world, demonstrating improved navigation performance, with enhanced robustness and time efficiency across varying terrains compared to existing methods.
♻ ☆ SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.
Computer Vision and Pattern Recognition 124
☆ VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning
Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.
comment: Project Page URL:https://libaolu312.github.io/VFXMaster/
☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
☆ Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
☆ Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
☆ Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
☆ RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.
comment: 26 pages, 10 figures, 18 tables
☆ Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography
Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.
comment: 27 pages, 8 figures
☆ FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NeurIPS 2025
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
comment: Accepted to NeurIPS 2025; Code is available at https://github.com/m-parchami/FaCT
☆ SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments
Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
☆ More than a Moment: Towards Coherent Sequences of Audio Descriptions
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
☆ Instance-Level Composed Image Retrieval NeurIPS 2025
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
comment: NeurIPS 2025
Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
☆ 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework MICCAI
Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p<0.05), despite the unavailability of expert annotations.
comment: 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to be published in Volume 16206 of the Lecture Notes in Computer Science series
☆ Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples
Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.
comment: Accepted by IEEE Transactions on Image Processing (TIP), 2025
☆ StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Accepted by SenSys 2026
☆ Prototype-Driven Adaptation for Few-Shot Object Detection
Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based "second opinion" complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.
comment: 7 pages,1 figure,2 tables,Preprint
☆ Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design
Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.
comment: The source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI
☆ GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction
Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.
☆ Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation NeurIPS 2025
Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.
comment: Accepted by NeurIPS 2025
☆ SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
☆ LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
comment: 10 pages, 5 figures, 14 tables, Neurips 2025
☆ RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
☆ Mapping and Classification of Trees Outside Forests using Deep Learning
Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper
☆ VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations
Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.
☆ DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis ICCV 2025
Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
comment: ICCV 2025
☆ Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
comment: 18 pages, 6 figures, accepted to ICXR 2025 conference
☆ Balanced conic rectified flow NeurIPS 2025
Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.
comment: Main paper: 10 pages (total 40 pages including appendix), 5 figures. Accepted at NeurIPS 2025 (Poster). Acknowledgment: Supported by the NRF of Korea (RS-2023-00223062) and IITP grants (RS-2020-II201361, RS-2024-00439762) funded by the Korean government (MSIT)
☆ Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation
Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.
comment: 5 pages, 3 figures
☆ MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo
Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.
☆ U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching NeurIPS 2025
Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.
comment: Accepted by NeurIPS 2025. Project page: https://gloriasze.github.io/U-CAN/
☆ AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis
Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.
☆ Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks
In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper's focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals' faces culled from images captured by students' open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students' face databases is performed, effectively ascertaining the identity of each student.
comment: 9 pages, 10 figures
☆ Test-Time Adaptive Object Detection with Foundation Model NeurIPS 2025
In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.
comment: Accepted by NeurIPS 2025
☆ Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation
Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.
comment: Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)
☆ $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction
Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.
☆ A Study on Inference Latency for Vision Transformers on Mobile Devices
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
comment: To appear in Springer LNICST, volume 663, Proceedings of VALUETOOLS 2024
Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
☆ Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation
Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.
☆ Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers
Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.
comment: 6 pages, 2 figures, will be updated
☆ EA3D: Online Open-World 3D Object Extraction from Streaming Videos NeurIPS 2025
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)
☆ Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective
The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.
☆ DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
☆ Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks
Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.
comment: Preprint for journal paper
☆ AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians NeurIPS 2025
3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.
comment: 18 pages, 11 figures. NeurIPS 2025; Project page: https://zju3dv.github.io/AtlasGS/
☆ Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection NeurIPS 2025
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.
comment: Accepted by NeurIPS 2025
☆ PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes
Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.
comment: Accepted by Image and Vision Computing (18 pages, 8 figures)
☆ Neighborhood Feature Pooling for Remote Sensing Image Classification WACV 2026
In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures. Accepted to WACV 2026 (Winter Conference on Applications of Computer Vision)
☆ Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.
comment: Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
☆ DRIP: Dynamic patch Reduction via Interpretable Pooling
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
☆ Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge
In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.
comment: BraTS23 winner
☆ Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models ICCV 2025
Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders NeurIPS 2025
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
comment: Accepted to NeurIPS 2025
☆ Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI
Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.
☆ Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/
comment: Submitted to Hydrology and Earth System Sciences
☆ CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
☆ DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones' high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS's flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
comment: Preprint version. This manuscript is currently under review at Transportation Research Part C: Emerging Technologies. The PDF corresponds to the version submitted in June 2025. The main findings of this work were recognized with the Best Intelligent Transportation Systems Paper Award at the 2025 TRB Annual Meeting
☆ Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model's ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
☆ Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI
In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.
comment: Paper for the Trackrad2025 challenge, Team BreizhTrack
☆ Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.
☆ SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing NeurIPS 2025
Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.
comment: Camera-ready version for NeurIPS 2025, 10 pages (main paper)
☆ Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy
Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.
☆ BikeScenes: Online LiDAR Semantic Segmentation for Bicycles
The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.
☆ MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
comment: Project page: https://nicolas-dufour.github.io/miro
☆ Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start SP
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
comment: Project Page: https://github.com/Kwen-Chen/SPECS-VL
☆ Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks
This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.
♻ ☆ GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction
Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
♻ ☆ MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction
While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.
comment: 10 pages. A presentation video of our approach is available at https://youtu.be/_SGNhhNz0fE
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning
Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.
♻ ☆ Quantizing Space and Time: Fusing Time Series and Images for Earth Observation
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50% in R^2 and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
♻ ☆ FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering NeurIPS 2025
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
comment: Accepted by NeurIPS 2025 - main track. Project page: https://focus-mllm-vqa.github.io/
♻ ☆ RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
♻ ☆ ScribbleVS: Scribble-Supervised Medical Image Segmentation via Dynamic Competitive Pseudo Label Selection
In clinical medicine, precise image segmentation can provide substantial support to clinicians. However, obtaining high-quality segmentation typically demands extensive pixel-level annotations, which are labor-intensive and expensive. Scribble annotations offer a more cost-effective alternative by improving labeling efficiency. Nonetheless, using such sparse supervision for training reliable medical image segmentation models remains a significant challenge. Some studies employ pseudo-labeling to enhance supervision, but these methods are susceptible to noise interference. To address these challenges, we introduce ScribbleVS, a framework designed to learn from scribble annotations. We introduce a Regional Pseudo Labels Diffusion Module to expand the scope of supervision and reduce the impact of noise present in pseudo labels. Additionally, we introduce a Dynamic Competitive Selection module for enhanced refinement in selecting pseudo labels. Experiments conducted on the ACDC, MSCMRseg, WORD, and BraTS2020 datasets demonstrate promising results, achieving segmentation precision comparable to fully supervised models. The codes of this study are available at https://github.com/ortonwang/ScribbleVS.
♻ ☆ HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene NeurIPS 2025
Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppress redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.
comment: Accepted to NeurIPS 2025. Project page: https://echopickle.github.io/HAIF-GS.github.io/
♻ ☆ Simulating Automotive Radar with Lidar and Camera Inputs IROS 2025
Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.
comment: Accepted by IROS 2025
♻ ☆ FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving NeurIPS 2025
Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.
comment: Accepted to NeurIPS 2025 as Spotlight Presentation. Code: https://github.com/MIV-XJTU/FSDrive
♻ ☆ Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
comment: 8 pages, 2 figures, 1 table. Invited report, CCN 2025 Algonauts Project session (3rd-place team). Code: https://github.com/erensemih/Algonauts2025_ModalityRNN v3: Added equal contribution footnote to author list. Corrected reference list
♻ ☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.
♻ ☆ U-DECN: End-to-End Underwater Object Detection ConvNet with Improved DeNoising Training
Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on unmanned underwater vehicles. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Deformable Convolution in SIM and Separate Contrastive DeNoising Forward methods. To address the underwater color cast noise issue, we propose an Underwater Color DeNoising Query method to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves the best 64.0 AP on DUO and the best 58.1 AP on RUOD, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.
comment: 10 pages, 6 figures, 7 tables, accepted by IEEE TGRS
♻ ☆ Probabilistic Kernel Function for Fast Angle Testing
In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.
♻ ☆ FastJAM: a Fast Joint Alignment Model for Images NeurIPS 2025
Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, https://bgu-cs-vil.github.io/FastJAM/
comment: Accepted to NeurIPS 2025. Pages 1-10 are the Main Paper. Pages 23-31 are Supplemental Material. FastJAM website - https://bgu-cs-vil.github.io/FastJAM/
♻ ☆ Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.
♻ ☆ Activation Matching for Explanation Generation
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
♻ ☆ Single Image Estimation of Cell Migration Direction by Deep Circular Regression
In this paper, we address the problem of estimating the migration direction of cells based on a single image. A solution to this problem lays the foundation for a variety of applications that were previously not possible. To our knowledge, there is only one related work that employs a classification CNN with four classes (quadrants). However, this approach does not allow for detailed directional resolution. We tackle the single image estimation problem using deep circular regression, with a particular focus on cycle-sensitive methods. On two common datasets, we achieve a mean estimation error of $\sim\!17^\circ$, representing a significant improvement over previous work, which reported estimation error of $30^\circ$ and $34^\circ$, respectively.
♻ ☆ When are radiology reports useful for training medical image classifiers?
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
♻ ☆ MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.
♻ ☆ HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models NeurIPS2025
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.
comment: Accepted by NeurIPS2025 (Oral)
♻ ☆ DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment
Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.
♻ ☆ XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.
♻ ☆ NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.
comment: Accepted by Neurips2025, Project page at at https://artanic30.github.io/project_pages/NoisyGRPO/
♻ ☆ InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes
Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.
♻ ☆ DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model
All-in-One image restoration aims to address multiple image degradation problems using a single model, offering a more practical and versatile solution compared to designing dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework that introduces a fine-grained degradation extractor and a Degradation-Aware Prompt State Space Model (DP-SSM). The DP-SSM leverages the fine-grained degradation features captured by the extractor as dynamic prompts, which are then incorporated into the state space modeling process. This enhances the model's adaptability to diverse degradation types, while a complementary High-Frequency Enhancement Block (HEB) recovers local high-frequency details. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
♻ ☆ Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
♻ ☆ InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
♻ ☆ HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment
Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.
comment: After the publication of the paper, we discovered some significant errors/omissions that need to be corrected and improved
♻ ☆ Diverse Teaching and Label Propagation for Generic Semi-Supervised Medical Image Segmentation
Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.
comment: Under Review
♻ ☆ Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.
♻ ☆ Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation
Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.
♻ ☆ Why Foundation Models in Pathology Are Failing
In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.
♻ ☆ LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
comment: Withdrawn because the submission was premature and not agreed by all parties in collaboration
♻ ☆ Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning NeurIPS2025
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
comment: [NeurIPS2025] Project Page: https://codegoat24.github.io/UnifiedReward/think
♻ ☆ RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation NeurIPS 2025
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
comment: 25 pages, 18 figures, Accepted by NeurIPS 2025
♻ ☆ L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery NeurIPS 2025
We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.
comment: 17 pages, 7 figures, NeurIPS 2025
♻ ☆ Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation AAAI 2026
Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.
comment: Accepted to the AAAI 2026 Student Abstract and Poster Program
♻ ☆ AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
comment: 2 tables, 5 figures
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code: https://github.com/patch0816/caption-driven-xai
♻ ☆ Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers NeurIPS 2025
This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
comment: NeurIPS 2025. Project page: https://pixel-perfect-depth.github.io/
♻ ☆ WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement
Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.
comment: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Oral
♻ ☆ Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
comment: camera ready revision
♻ ☆ Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum NeurIPS 2025
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
comment: NeurIPS 2025
♻ ☆ Dynamic Traceback Learning for Medical Report Generation
Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
comment: Accepted to IEEE Transactions on Multimedia (TMM)
♻ ☆ Learning World Models for Interactive Video Generation
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
comment: Project page: https://sites.google.com/view/vrag
♻ ☆ OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.
comment: 8 pages, 1 supplementary page, 9 figures
♻ ☆ Quality-Aware Prototype Memory for Face Representation Learning
Prototype Memory is a powerful model for face representation learning. It enables training face recognition models on datasets of any size by generating prototypes (classifier weights) on the fly and efficiently utilizing them. Prototype Memory demonstrated strong results in many face recognition benchmarks. However, the algorithm of prototype generation, used in it, is prone to the problems of imperfectly calculated prototypes in case of low-quality or poorly recognizable faces in the images, selected for the prototype creation. All images of the same person presented in the mini-batch are used with equal weights, and the resulting averaged prototype can be contaminated by imperfect embeddings of low-quality face images. This may lead to misleading training signals and degrade the performance of the trained models. In this paper, we propose a simple and effective way to improve Prototype Memory with quality-aware prototype generation. Quality-Aware Prototype Memory uses different weights for images of different quality in the process of prototype generation. With this improvement, prototypes receive more informative signals from high-quality images and are less affected by low-quality ones. We propose and compare several methods of quality estimation and usage, perform extensive experiments on the different face recognition benchmarks and demonstrate the advantages of the proposed model compared to the basic version of Prototype Memory.
comment: Preprint
♻ ☆ GenIR: Generative Visual Feedback for Mental Image Retrieval NeurIPS 2025
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction
comment: NeurIPS 2025
♻ ☆ MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory ACM MM '25
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.
comment: Updated version: corresponds to the ACM MM '25 published paper and includes full appendix material
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
comment: Datasets link: https://huggingface.co/datasets/LLDDSS/Causal3D_Dataset
♻ ☆ LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering NeurIPS 2025
In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.
comment: NeurIPS 2025; Web: https://lodge-gs.github.io/
♻ ☆ NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods NeurIPS 2025
Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and robotic simulations. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. In our experiments, we show that even tiny differences in the evaluation protocols of various methods can artificially boost the performance of these methods. This raises questions about the validity of quantitative comparisons performed in the literature. To address these questions, we propose NerfBaselines, an evaluation framework which provides consistent benchmarking tools, ensures reproducibility, and simplifies the installation and use of various methods. We validate our implementation experimentally by reproducing the numbers reported in the original papers. For improved accessibility, we release a web platform that compares commonly used methods on standard benchmarks. We strongly believe NerfBaselines is a valuable contribution to the community as it ensures that quantitative results are comparable and thus truly measure progress in the field of novel view synthesis.
comment: NeurIPS 2025 D&B; Web: https://jkulhanek.com/nerfbaselines
♻ ☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.
comment: Accepted by EMNLP 2025
♻ ☆ KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning
Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Neural Atlas Graphs for Dynamic Scene Decomposition and Editing
Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/
♻ ☆ Functional correspondence by matrix completion CVPR 2015
In this paper, we consider the problem of finding dense intrinsic correspondence between manifolds using the recently introduced functional framework. We pose the functional correspondence problem as matrix completion with manifold geometric structure and inducing functional localization with the $L_1$ norm. We discuss efficient numerical procedures for the solution of our problem. Our method compares favorably to the accuracy of state-of-the-art correspondence algorithms on non-rigid shape matching benchmarks, and is especially advantageous in settings when only scarce data is available.
comment: "Functional Correspondence by Matrix Completion" (CVPR 2015): This paper, presented at one of the world's top AI conferences, is almost entirely fabricated, and its results are not reproducible
Artificial Intelligence 227
☆ Gaperon: A Peppered English-French Generative Language Model Suite
We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
☆ E-Scores for (In)Correctness Assessment of Generative Model Outputs
While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
☆ TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling
Large language models (LLMs) in psychological counseling have attracted increasing attention. However, existing approaches often lack emotional understanding, adaptive strategies, and the use of therapeutic methods across multiple sessions with long-term memory, leaving them far from real clinical practice. To address these critical gaps, we introduce TheraMind, a strategic and adaptive agent for longitudinal psychological counseling. The cornerstone of TheraMind is a novel dual-loop architecture that decouples the complex counseling process into an Intra-Session Loop for tactical dialogue management and a Cross-Session Loop for strategic therapeutic planning. The Intra-Session Loop perceives the patient's emotional state to dynamically select response strategies while leveraging cross-session memory to ensure continuity. Crucially, the Cross-Session Loop empowers the agent with long-term adaptability by evaluating the efficacy of the applied therapy after each session and adjusting the method for subsequent interactions. We validate our approach in a high-fidelity simulation environment grounded in real clinical cases. Extensive evaluations show that TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validating the effectiveness of its dual-loop design in emulating strategic, adaptive, and longitudinal therapeutic behavior. The code is publicly available at https://0mwwm0.github.io/TheraMind/.
☆ Task Completion Agents are Not Ideal Collaborators
Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
☆ The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.
comment: 14 pages, 11 figures
☆ LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries
We introduce a method for efficiently solving initial-boundary value problems (IBVPs) that uses Lie symmetries to enforce the associated partial differential equation (PDE) exactly by construction. By leveraging symmetry transformations, the model inherently incorporates the physical laws and learns solutions from initial and boundary data. As a result, the loss directly measures the model's accuracy, leading to improved convergence. Moreover, for well-posed IBVPs, our method enables rigorous error estimation. The approach yields compact models, facilitating an efficient optimization. We implement LieSolver and demonstrate its application to linear homogeneous PDEs with a range of initial conditions, showing that it is faster and more accurate than physics-informed neural networks (PINNs). Overall, our method improves both computational efficiency and the reliability of predictions for PDE-constrained problems.
☆ Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction
A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.
☆ The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
comment: Website: https://toolathlon.xyz/
☆ BambooKG: A Neurobiologically-inspired Frequency-Weight Knowledge Graph
Retrieval-Augmented Generation allows LLMs to access external knowledge, reducing hallucinations and ageing-data issues. However, it treats retrieved chunks independently and struggles with multi-hop or relational reasoning, especially across documents. Knowledge graphs enhance this by capturing the relationships between entities using triplets, enabling structured, multi-chunk reasoning. However, these tend to miss information that fails to conform to the triplet structure. We introduce BambooKG, a knowledge graph with frequency-based weights on non-triplet edges which reflect link strength, drawing on the Hebbian principle of "fire together, wire together". This decreases information loss and results in improved performance on single- and multi-hop reasoning, outperforming the existing solutions.
☆ Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
☆ Graph Network-based Structural Simulator: Graph Neural Networks for Structural Dynamics
Graph Neural Networks (GNNs) have recently been explored as surrogate models for numerical simulations. While their applications in computational fluid dynamics have been investigated, little attention has been given to structural problems, especially for dynamic cases. To address this gap, we introduce the Graph Network-based Structural Simulator (GNSS), a GNN framework for surrogate modeling of dynamic structural problems. GNSS follows the encode-process-decode paradigm typical of GNN-based machine learning models, and its design makes it particularly suited for dynamic simulations thanks to three key features: (i) expressing node kinematics in node-fixed local frames, which avoids catastrophic cancellation in finite-difference velocities; (ii) employing a sign-aware regression loss, which reduces phase errors in long rollouts; and (iii) using a wavelength-informed connectivity radius, which optimizes graph construction. We evaluate GNSS on a case study involving a beam excited by a 50kHz Hanning-modulated pulse. The results show that GNSS accurately reproduces the physics of the problem over hundreds of timesteps and generalizes to unseen loading conditions, where existing GNNs fail to converge or deliver meaningful predictions. Compared with explicit finite element baselines, GNSS achieves substantial inference speedups while preserving spatial and temporal fidelity. These findings demonstrate that locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations.
comment: 16 pages, 14 figures
☆ Navigation in a Three-Dimensional Urban Flow using Deep Reinforcement Learning
Unmanned Aerial Vehicles (UAVs) are increasingly populating urban areas for delivery and surveillance purposes. In this work, we develop an optimal navigation strategy based on Deep Reinforcement Learning. The environment is represented by a three-dimensional high-fidelity simulation of an urban flow, characterized by turbulence and recirculation zones. The algorithm presented here is a flow-aware Proximal Policy Optimization (PPO) combined with a Gated Transformer eXtra Large (GTrXL) architecture, giving the agent richer information about the turbulent flow field in which it navigates. The results are compared with a PPO+GTrXL without the secondary prediction tasks, a PPO combined with Long Short Term Memory (LSTM) cells and a traditional navigation algorithm. The obtained results show a significant increase in the success rate (SR) and a lower crash rate (CR) compared to a PPO+LSTM, PPO+GTrXL and the classical Zermelo's navigation algorithm, paving the way to a completely reimagined UAV landscape in complex urban environments.
☆ ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
☆ User Misconceptions of LLM-Based Conversational Programming Assistants
Programming assistants powered by large language models (LLMs) have become widely available, with conversational assistants like ChatGPT proving particularly accessible to less experienced programmers. However, the varied capabilities of these tools across model versions and the mixed availability of extensions that enable web search, code execution, or retrieval-augmented generation create opportunities for user misconceptions about what systems can and cannot do. Such misconceptions may lead to over-reliance, unproductive practices, or insufficient quality control in LLM-assisted programming. Here, we aim to characterize misconceptions that users of conversational LLM-based assistants may have in programming contexts. Using a two-phase approach, we first brainstorm and catalog user misconceptions that may occur, and then conduct a qualitative analysis to examine whether these conceptual issues surface in naturalistic Python-programming conversations with an LLM-based chatbot drawn from an openly available dataset. Indeed, we see evidence that some users have misplaced expectations about the availability of LLM-based chatbot features like web access, code execution, or non-text output generation. We also see potential evidence for deeper conceptual issues around the scope of information required to debug, validate, and optimize programs. Our findings reinforce the need for designing LLM-based tools that more clearly communicate their programming capabilities to users.
☆ Subgraph Federated Learning via Spectral Methods NeurIPS
We consider the problem of federated learning (FL) with graph-structured data distributed across multiple clients. In particular, we address the prevalent scenario of interconnected subgraphs, where interconnections between clients significantly influence the learning process. Existing approaches suffer from critical limitations, either requiring the exchange of sensitive node embeddings, thereby posing privacy risks, or relying on computationally-intensive steps, which hinders scalability. To tackle these challenges, we propose FedLap, a novel framework that leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while ensuring privacy and scalability. We provide a formal analysis of the privacy of FedLap, demonstrating that it preserves privacy. Notably, FedLap is the first subgraph FL scheme with strong privacy guarantees. Extensive experiments on benchmark datasets demonstrate that FedLap achieves competitive or superior utility compared to existing techniques.
comment: To be presented at The Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
☆ Learning to Plan & Schedule with Reinforcement-Learned Bimanual Robot Skills
Long-horizon contact-rich bimanual manipulation presents a significant challenge, requiring complex coordination involving a mixture of parallel execution and sequential collaboration between arms. In this paper, we introduce a hierarchical framework that frames this challenge as an integrated skill planning & scheduling problem, going beyond purely sequential decision-making to support simultaneous skill invocation. Our approach is built upon a library of single-arm and bimanual primitive skills, each trained using Reinforcement Learning (RL) in GPU-accelerated simulation. We then train a Transformer-based planner on a dataset of skill compositions to act as a high-level scheduler, simultaneously predicting the discrete schedule of skills as well as their continuous parameters. We demonstrate that our method achieves higher success rates on complex, contact-rich tasks than end-to-end RL approaches and produces more efficient, coordinated behaviors than traditional sequential-only planners.
☆ Are Language Models Efficient Reasoners? A Perspective from Logic Programming NeurIPS 2025
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
comment: Accepted to NeurIPS 2025
☆ FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering
The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.
comment: 37 pages, 5 figures, 10 tables. Keywords: Retrieval-Augmented Generation (RAG), Question Answering (QA), Islamic Knowledge Base, Faithful AI, Persian NLP, Multi-hop Reasoning, Large Language Models (LLMs)
☆ Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
comment: 13 pages, 6 figures
☆ Counterfactual-based Agent Influence Ranker for Agentic AI Workflows EMNLP 2025
An Agentic AI Workflow (AAW), also known as an LLM-based multi-agent system, is an autonomous system that assembles several LLM-based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW's final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual-based Agent Influence Ranker (CAIR) - the first method for assessing the influence level of each agent on the AAW's output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task-agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.
comment: Accepted to EMNLP 2025, 27 pages, 6 figures
☆ BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training
We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.
☆ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
☆ Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry ICML 2025
While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles
comment: Workshop on Multi-Agent System @ ICML 2025
☆ RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.
comment: 26 pages, 10 figures, 18 tables
☆ Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System
The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine-Tuned Large Language Model (LLM) Consortium and OpenAI-gpt-oss Reasoning LLM-enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine-tuned LLMs trained on conversational datasets involving psychiatrist-patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus-based decision-making process, refined by the OpenAI-gpt-oss reasoning LLM. We propose a novel method for deploying LLM agents that orchestrate communication between the LLM consortium and the reasoning LLM, ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine-tuned LLMs with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine-tuned LLMs with the OpenAI-gpt-oss reasoning LLM, was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine-tuned LLM consortium integrated with a reasoning LLM for clinical mental health diagnosis paving the way for next-generation AI-powered eHealth systems aimed at standardizing psychiatric diagnoses.
☆ Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models LREC 2026
Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.
comment: 8 pages, 3 figures, 4 tables, submitted to LREC 2026
☆ Leveraging an Atmospheric Foundational Model for Subregional Sea Surface Temperature Forecasting
The accurate prediction of oceanographic variables is crucial for understanding climate change, managing marine resources, and optimizing maritime activities. Traditional ocean forecasting relies on numerical models; however, these approaches face limitations in terms of computational cost and scalability. In this study, we adapt Aurora, a foundational deep learning model originally designed for atmospheric forecasting, to predict sea surface temperature (SST) in the Canary Upwelling System. By fine-tuning this model with high-resolution oceanographic reanalysis data, we demonstrate its ability to capture complex spatiotemporal patterns while reducing computational demands. Our methodology involves a staged fine-tuning process, incorporating latitude-weighted error metrics and optimizing hyperparameters for efficient learning. The experimental results show that the model achieves a low RMSE of 0.119K, maintaining high anomaly correlation coefficients (ACC $\approx 0.997$). The model successfully reproduces large-scale SST structures but faces challenges in capturing finer details in coastal regions. This work contributes to the field of data-driven ocean forecasting by demonstrating the feasibility of using deep learning models pre-trained in different domains for oceanic applications. Future improvements include integrating additional oceanographic variables, increasing spatial resolution, and exploring physics-informed neural networks to enhance interpretability and understanding. These advancements can improve climate modeling and ocean prediction accuracy, supporting decision-making in environmental and economic sectors.
comment: 18 pages, 9 figures
☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
☆ Using latent representations to link disjoint longitudinal data for mixed-effects regression
Many rare diseases offer limited established treatment options, leading patients to switch therapies when new medications emerge. To analyze the impact of such treatment switches within the low sample size limitations of rare disease trials, it is important to use all available data sources. This, however, is complicated when usage of measurement instruments change during the observation period, for example when instruments are adapted to specific age ranges. The resulting disjoint longitudinal data trajectories, complicate the application of traditional modeling approaches like mixed-effects regression. We tackle this by mapping observations of each instrument to a aligned low-dimensional temporal trajectory, enabling longitudinal modeling across instruments. Specifically, we employ a set of variational autoencoder architectures to embed item values into a shared latent space for each time point. Temporal disease dynamics and treatment switch effects are then captured through a mixed-effects regression model applied to latent representations. To enable statistical inference, we present a novel statistical testing approach that accounts for the joint parameter estimation of mixed-effects regression and variational autoencoders. The methodology is applied to quantify the impact of treatment switches for patients with spinal muscular atrophy. Here, our approach aligns motor performance items from different measurement instruments for mixed-effects regression and maps estimated effects back to the observed item level to quantify the treatment switch effect. Our approach allows for model selection as well as for assessing effects of treatment switching. The results highlight the potential of modeling in joint latent representations for addressing small data challenges.
comment: 31 pages, 3 figures, 3 tables
☆ Off-policy Reinforcement Learning with Model-based Exploration Augmentation
Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.
☆ Zero Reinforcement Learning Towards General Domains
Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.
☆ Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography
Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model's superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region's most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.
comment: 27 pages, 8 figures
☆ Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation
Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question--answer--reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.
comment: Keywords: RAG Agentic AI Fintech NLP KB Domain-Specific Ontology Query Understanding
☆ Predicate Renaming via Large Language Models
In this paper, we address the problem of giving names to predicates in logic rules using Large Language Models (LLMs). In the context of Inductive Logic Programming, various rule generation methods produce rules containing unnamed predicates, with Predicate Invention being a key example. This hinders the readability, interpretability, and reusability of the logic theory. Leveraging recent advancements in LLMs development, we explore their ability to process natural language and code to provide semantically meaningful suggestions for giving a name to unnamed predicates. The evaluation of our approach on some hand-crafted logic rules indicates that LLMs hold potential for this task.
☆ FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NeurIPS 2025
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
comment: Accepted to NeurIPS 2025; Code is available at https://github.com/m-parchami/FaCT
☆ MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL
As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves \textbf{64.4}\% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.
☆ Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both other researchers and practitioners. One important step towards excelling in empirical research on LLMs and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions on how to improve the current state. In particular, we studied the 86 articles describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 86 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies. Of the 18 studies, only five were fit for reproduction. For none of the five studies, we were able to fully reproduce the results. Two studies seemed to be partially reproducible, and three studies did not seem to be reproducible. Our results highlight not only the need for stricter research artefact evaluations but also for more robust study designs to ensure the reproducible value of future publications.
☆ Multi-Objective Search: Algorithms, Applications, and Emerging Directions
Multi-objective search (MOS) has emerged as a unifying framework for planning and decision-making problems where multiple, often conflicting, criteria must be balanced. While the problem has been studied for decades, recent years have seen renewed interest in the topic across AI applications such as robotics, transportation, and operations research, reflecting the reality that real-world systems rarely optimize a single measure. This paper surveys developments in MOS while highlighting cross-disciplinary opportunities, and outlines open challenges that define the emerging frontier of MOS
☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
☆ Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle's ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers' intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.
☆ An In-Depth Analysis of Cyber Attacks in Secured Platforms
There is an increase in global malware threats. To address this, an encryption-type ransomware has been introduced on the Android operating system. The challenges associated with malicious threats in phone use have become a pressing issue in mobile communication, disrupting user experiences and posing significant privacy threats. This study surveys commonly used machine learning techniques for detecting malicious threats in phones and examines their performance. The majority of past research focuses on customer feedback and reviews, with concerns that people might create false reviews to promote or devalue products and services for personal gain. Hence, the development of techniques for detecting malicious threats using machine learning has been a key focus. This paper presents a comprehensive comparative study of current research on the issue of malicious threats and methods for tackling these challenges. Nevertheless, a huge amount of information is required by these methods, presenting a challenge for developing robust, specialized automated anti-malware systems. This research describes the Android Applications dataset, and the accuracy of the techniques is measured using the accuracy levels of the metrics employed in this study.
☆ Fine-Tuned Language Models for Domain-Specific Summarization and Tagging
This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.
☆ Scalable Utility-Aware Multiclass Calibration
Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.
☆ Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions
Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models -- a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the Symbolic/Classical (relying on algorithmic planning and persistent state) and the Neural/Generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018--2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.
☆ Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
comment: 27 pages, 5 figures
☆ Alibaba International E-commerce Product Search Competition DcuRAGONs Team Technical Report CIKM 2025
This report details our methodology and results developed for the Multilingual E-commerce Search Competition. The problem aims to recognize relevance between user queries versus product items in a multilingual context and improve recommendation performance on e-commerce platforms. Utilizing Large Language Models (LLMs) and their capabilities in other tasks, our data-centric method achieved the highest score compared to other solutions during the competition. Final leaderboard is publised at https://alibaba-international-cikm2025.github.io. The source code for our project is published at https://github.com/nhtlongcs/e-commerce-product-search.
comment: Alibaba International E-commerce Product Search Competition @ CIKM 2025
☆ RLMEval: Evaluating Research-Level Neural Theorem Proving EMNLP 2025
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
comment: Accepted to EMNLP 2025 Findings. RLMEval benchmark released: https://github.com/augustepoiroux/RLMEval
☆ Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction
The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.
comment: The manuscript is approximately 7360 words and contains 12 figures and 6 tables
☆ Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models
Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fr\'echet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception--distortion trade-off across all tasks.
☆ Adaptive End-to-End Transceiver Design for NextG Pilot-Free and CP-Free Wireless Systems
The advent of artificial intelligence (AI)-native wireless communication is fundamentally reshaping the design paradigm of next-generation (NextG) systems, where intelligent air interfaces are expected to operate adaptively and efficiently in highly dynamic environments. Conventional orthogonal frequency division multiplexing (OFDM) systems rely heavily on pilots and the cyclic prefix (CP), resulting in significant overhead and reduced spectral efficiency. To address these limitations, we propose an adaptive end-to-end (E2E) transceiver architecture tailored for pilot-free and CP-free wireless systems. The architecture combines AI-driven constellation shaping and a neural receiver through joint training. To enhance robustness against mismatched or time-varying channel conditions, we introduce a lightweight channel adapter (CA) module, which enables rapid adaptation with minimal computational overhead by updating only the CA parameters. Additionally, we present a framework that is scalable to multiple modulation orders within a unified model, significantly reducing model storage requirements. Moreover, to tackle the high peak-to-average power ratio (PAPR) inherent to OFDM, we incorporate constrained E2E training, achieving compliance with PAPR targets without additional transmission overhead. Extensive simulations demonstrate that the proposed framework delivers superior bit error rate (BER), throughput, and resilience across diverse channel scenarios, highlighting its potential for AI-native NextG.
comment: Submitted to IEEE for possible publication
☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
GPTOpt: Towards Efficient LLM-Based Black-Box Optimization
Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency. Classical methods such as Bayesian Optimization (BO) can be effective, but they often require careful parameter tuning to each application domain. At the same time, Large Language Models (LLMs) have shown broad capabilities, yet state-of-the-art models remain limited in solving continuous black-box optimization tasks. We introduce GPTOpt, an LLM-based optimization method that equips LLMs with continuous black-box optimization capabilities. By fine-tuning large language models on extensive synthetic datasets derived from diverse BO parameterizations, GPTOpt leverages LLM pre-training to generalize across optimization tasks. On a variety of black-box optimization benchmarks, GPTOpt surpasses traditional optimizers, highlighting the capacity of LLMs for advanced numerical reasoning and introducing a flexible framework for global optimization without parameter tuning.
☆ Grouping Nodes With Known Value Differences: A Lossless UCT-based Abstraction Algorithm
A core challenge of Monte Carlo Tree Search (MCTS) is its sample efficiency, which can be improved by grouping state-action pairs and using their aggregate statistics instead of single-node statistics. On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT) is the state-of-the-art MCTS abstraction algorithm for deterministic environments that builds its abstraction using the Abstractions of State-Action Pairs (ASAP) framework, which aims to detect states and state-action pairs with the same value under optimal play by analysing the search graph. ASAP, however, requires two state-action pairs to have the same immediate reward, which is a rigid condition that limits the number of abstractions that can be found and thereby the sample efficiency. In this paper, we break with the paradigm of grouping value-equivalent states or state-action pairs and instead group states and state-action pairs with possibly different values as long as the difference between their values can be inferred. We call this abstraction framework Known Value Difference Abstractions (KVDA), which infers the value differences by analysis of the immediate rewards and modifies OGA-UCT to use this framework instead. The modification is called KVDA-UCT, which detects significantly more abstractions than OGA-UCT, introduces no additional parameter, and outperforms OGA-UCT on a variety of deterministic environments and parameter settings.
☆ Integrating Legal and Logical Specifications in Perception, Prediction, and Planning for Automated Driving: A Survey of Methods
This survey provides an analysis of current methodologies integrating legal and logical specifications into the perception, prediction, and planning modules of automated driving systems. We systematically explore techniques ranging from logic-based frameworks to computational legal reasoning approaches, emphasizing their capability to ensure regulatory compliance and interpretability in dynamic and uncertain driving environments. A central finding is that significant challenges arise at the intersection of perceptual reliability, legal compliance, and decision-making justifiability. To systematically analyze these challenges, we introduce a taxonomy categorizing existing approaches by their theoretical foundations, architectural implementations, and validation strategies. We particularly focus on methods that address perceptual uncertainty and incorporate explicit legal norms, facilitating decisions that are both technically robust and legally defensible. The review covers neural-symbolic integration methods for perception, logic-driven rule representation, and norm-aware prediction strategies, all contributing toward transparent and accountable autonomous vehicle operation. We highlight critical open questions and practical trade-offs that must be addressed, offering multidisciplinary insights from engineering, logic, and law to guide future developments in legally compliant autonomous driving systems.
comment: Accepted to 2025 IEEE International Automated Vehicle Validation Conference (IAVVC)
☆ Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy
Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.
☆ Position: Biology is the Challenge Physics-Informed ML Needs to Evolve
Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws. This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints. Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks. In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology. Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge. We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling. We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.
☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award)
☆ Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork
Multi-agent reinforcement learning (MARl) has achieved strong results in cooperative tasks but typically assumes fixed, fully controlled teams. Ad hoc teamwork (AHT) relaxes this by allowing collaboration with unknown partners, yet existing variants still presume shared conventions. We introduce Multil-party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple mutually unfamiliar groups of uncontrolled teammates. To address this, we propose MARs, which builds a sparse skeleton graph and applies relational modeling to capture cross-group dvnamics. Experiments on MPE and starCralt ll show that MARs outperforms MARL and AHT baselines while converging faster.
☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Accepted by SenSys 2026
☆ GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning
Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: https://github.com/WJQ7777/Graph-Agent-Planning.
☆ 4-Doodle: Text to 3D Sketches that Move!
We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable B\'ezier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.
☆ Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning
Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.
comment: 21 pages, 5 figures
☆ SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
☆ IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning
Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.
☆ TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation NeurIPS 2025
Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
comment: The 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Scaling Up Bayesian DAG Sampling
Bayesian inference of Bayesian network structures is often performed by sampling directed acyclic graphs along an appropriately constructed Markov chain. We present two techniques to improve sampling. First, we give an efficient implementation of basic moves, which add, delete, or reverse a single arc. Second, we expedite summing over parent sets, an expensive task required for more sophisticated moves: we devise a preprocessing method to prune possible parent sets so as to approximately preserve the sums. Our empirical study shows that our techniques can yield substantial efficiency gains compared to previous methods.
☆ One-shot Humanoid Whole-body Motion Learning
Whole-body humanoid motion represents a cornerstone challenge in robotics, integrating balance, coordination, and adaptability to enable human-like behaviors. However, existing methods typically require multiple training samples per motion category, rendering the collection of high-quality human motion datasets both labor-intensive and costly. To address this, we propose a novel approach that trains effective humanoid motion policies using only a single non-walking target motion sample alongside readily available walking motions. The core idea lies in leveraging order-preserving optimal transport to compute distances between walking and non-walking sequences, followed by interpolation along geodesics to generate new intermediate pose skeletons, which are then optimized for collision-free configurations and retargeted to the humanoid before integration into a simulated environment for policy training via reinforcement learning. Experimental evaluations on the CMU MoCap dataset demonstrate that our method consistently outperforms baselines, achieving superior performance across metrics. Code will be released upon acceptance.
comment: 10 pages, 3 figures, 5 tables
☆ Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
comment: 18 pages, 6 figures, accepted to ICXR 2025 conference
☆ From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.
☆ Studies for : A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model NeurIPS
This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/
comment: Accepted at NeurIPS Creative AI Track 2025, 9 pages, 6 figures, 1 table, Demo page: https://sony.github.io/studies-for/
☆ Cost-Sensitive Unbiased Risk Estimation for Multi-Class Positive-Unlabeled Learning
Positive--Unlabeled (PU) learning considers settings in which only positive and unlabeled data are available, while negatives are missing or left unlabeled. This situation is common in real applications where annotating reliable negatives is difficult or costly. Despite substantial progress in PU learning, the multi-class case (MPU) remains challenging: many existing approaches do not ensure \emph{unbiased risk estimation}, which limits performance and stability. We propose a cost-sensitive multi-class PU method based on \emph{adaptive loss weighting}. Within the empirical risk minimization framework, we assign distinct, data-dependent weights to the positive and \emph{inferred-negative} (from the unlabeled mixture) loss components so that the resulting empirical objective is an unbiased estimator of the target risk. We formalize the MPU data-generating process and establish a generalization error bound for the proposed estimator. Extensive experiments on \textbf{eight} public datasets, spanning varying class priors and numbers of classes, show consistent gains over strong baselines in both accuracy and stability.
☆ FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data
Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs--characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures--make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents--Idea Agents, Code Agents, and Critic Agents--to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.
comment: 14 pages, 11 figures
☆ GReF: A Unified Generative Framework for Efficient Reranking via Ordered Multi-token Prediction CIKM 2025
In a multi-stage recommendation system, reranking plays a crucial role in modeling intra-list correlations among items. A key challenge lies in exploring optimal sequences within the combinatorial space of permutations. Recent research follows a two-stage (generator-evaluator) paradigm, where a generator produces multiple feasible sequences, and an evaluator selects the best one. In practice, the generator is typically implemented as an autoregressive model. However, these two-stage methods face two main challenges. First, the separation of the generator and evaluator hinders end-to-end training. Second, autoregressive generators suffer from inference efficiency. In this work, we propose a Unified Generative Efficient Reranking Framework (GReF) to address the two primary challenges. Specifically, we introduce Gen-Reranker, an autoregressive generator featuring a bidirectional encoder and a dynamic autoregressive decoder to generate causal reranking sequences. Subsequently, we pre-train Gen-Reranker on the item exposure order for high-quality parameter initialization. To eliminate the need for the evaluator while integrating sequence-level evaluation during training for end-to-end optimization, we propose post-training the model through Rerank-DPO. Moreover, for efficient autoregressive inference, we introduce ordered multi-token prediction (OMTP), which trains Gen-Reranker to simultaneously generate multiple future items while preserving their order, ensuring practical deployment in real-time recommender systems. Extensive offline experiments demonstrate that GReF outperforms state-of-the-art reranking methods while achieving latency that is nearly comparable to non-autoregressive models. Additionally, GReF has also been deployed in a real-world video app Kuaishou with over 300 million daily active users, significantly improving online recommendation quality.
comment: Accepted by CIKM 2025
☆ Human Resilience in the AI Era -- What Machines Can't Replace
AI is displacing tasks, mediating high-stakes decisions, and flooding communication with synthetic content, unsettling work, identity, and social trust. We argue that the decisive human countermeasure is resilience. We define resilience across three layers: psychological, including emotion regulation, meaning-making, cognitive flexibility; social, including trust, social capital, coordinated response; organizational, including psychological safety, feedback mechanisms, and graceful degradation. We synthesize early evidence that these capacities buffer individual strain, reduce burnout through social support, and lower silent failure in AI-mediated workflows through team norms and risk-responsive governance. We also show that resilience can be cultivated through training that complements rather than substitutes for structural safeguards. By reframing the AI debate around actionable human resilience, this article offers policymakers, educators, and operators a practical lens to preserve human agency and steer responsible adoption.
☆ RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models
Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
comment: 17 pages, 11 figures
☆ Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision ICDE2026
Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on largescale deep learning models to extract environmental features. Recently, numerous studies have employed model compression techniques, such as sparsification, quantization, and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. To address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. To adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. Extensive experiments evidence the superiority of our framework in both energy consumption and driving performance. EneAD can reduce perception consumption by 1.9x to 3.5x and thus improve driving range by 3.9% to 8.5%
comment: It was accepted by ICDE2026
☆ Fed-PELAD: Communication-Efficient Federated Learning for Massive MIMO CSI Feedback with Personalized Encoders and a LoRA-Adapted Shared Decoder
This paper addresses the critical challenges of communication overhead, data heterogeneity, and privacy in deep learning for channel state information (CSI) feedback in massive MIMO systems. To this end, we propose Fed-PELAD, a novel federated learning framework that incorporates personalized encoders and a LoRA-adapted shared decoder. Specifically, personalized encoders are trained locally on each user equipment (UE) to capture device-specific channel characteristics, while a shared decoder is updated globally via the coordination of the base station (BS) by using Low-Rank Adaptation (LoRA). This design ensures that only compact LoRA adapter parameters instead of full model updates are transmitted for aggregation. To further enhance convergence stability, we introduce an alternating freezing strategy with calibrated learning-rate ratio during LoRA aggregation. Extensive simulations on 3GPP-standard channel models demonstrate that Fed-PELAD requires only 42.97\% of the uplink communication cost compared to conventional methods while achieving a performance gain of 1.2 dB in CSI feedback accuracy under heterogeneous conditions.
☆ Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models
Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.
Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
comment: This work is to appear in the Proceedings of MICAD 2025, the 6th International Conference on Medical Imaging and Computer-Aided Diagnosis
☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
comment: 10 pages
☆ Lipschitz-aware Linearity Grafting for Certified Robustness
Lipschitz constant is a fundamental property in certified robustness, as smaller values imply robustness to adversarial examples when a model is confident in its prediction. However, identifying the worst-case adversarial examples is known to be an NP-complete problem. Although over-approximation methods have shown success in neural network verification to address this challenge, reducing approximation errors remains a significant obstacle. Furthermore, these approximation errors hinder the ability to obtain tight local Lipschitz constants, which are crucial for certified robustness. Originally, grafting linearity into non-linear activation functions was proposed to reduce the number of unstable neurons, enabling scalable and complete verification. However, no prior theoretical analysis has explained how linearity grafting improves certified robustness. We instead consider linearity grafting primarily as a means of eliminating approximation errors rather than reducing the number of unstable neurons, since linear functions do not require relaxation. In this paper, we provide two theoretical contributions: 1) why linearity grafting improves certified robustness through the lens of the $l_\infty$ local Lipschitz constant, and 2) grafting linearity into non-linear activation functions, the dominant source of approximation errors, yields a tighter local Lipschitz constant. Based on these theoretical contributions, we propose a Lipschitz-aware linearity grafting method that removes dominant approximation errors, which are crucial for tightening the local Lipschitz constant, thereby improving certified robustness, even without certified training. Our extensive experiments demonstrate that grafting linearity into these influential activations tightens the $l_\infty$ local Lipschitz constant and enhances certified robustness.
☆ Bridging the Divide: End-to-End Sequence-Graph Learning
Many real-world datasets are both sequential and relational: each node carries an event sequence while edges encode interactions. Existing methods in sequence modeling and graph modeling often neglect one modality or the other. We argue that sequences and graphs are not separate problems but complementary facets of the same dataset, and should be learned jointly. We introduce BRIDGE, a unified end-to-end architecture that couples a sequence encoder with a GNN under a single objective, allowing gradients to flow across both modules and learning task-aligned representations. To enable fine-grained token-level message passing among neighbors, we add TOKENXATTN, a token-level cross-attention layer that passes messages between events in neighboring sequences. Across two settings, friendship prediction (Brightkite) and fraud detection (Amazon), BRIDGE consistently outperforms static GNNs, temporal graph methods, and sequence-only baselines on ranking and classification metrics.
☆ Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data
We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.
comment: 41 pages, 18 figures
☆ The Neural Differential Manifold: An Architecture with Explicit Geometric Structure
This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold's metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.
comment: 9 pages
☆ KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
☆ Learning Fair Graph Representations with Multi-view Information Bottleneck
Graph neural networks (GNNs) excel on relational data by passing messages over node features and structure, but they can amplify training data biases, propagating discriminatory attributes and structural imbalances into unfair outcomes. Many fairness methods treat bias as a single source, ignoring distinct attribute and structure effects and leading to suboptimal fairness and utility trade-offs. To overcome this challenge, we propose FairMIB, a multi-view information bottleneck framework designed to decompose graphs into feature, structural, and diffusion views for mitigating complexity biases in GNNs. Especially, the proposed FairMIB employs contrastive learning to maximize cross-view mutual information for bias-free representation learning. It further integrates multi-perspective conditional information bottleneck objectives to balance task utility and fairness by minimizing mutual information with sensitive attributes. Additionally, FairMIB introduces an inverse probability-weighted (IPW) adjacency correction in the diffusion view, which reduces the spread of bias propagation during message passing. Experiments on five real-world benchmark datasets demonstrate that FairMIB achieves state-of-the-art performance across both utility and fairness metrics.
☆ H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts
Stock movement prediction remains fundamentally challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches often fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework. This work introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with LLM reasoning and Style-Structured Mixture of Experts, integrating three key innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph (LCH) and persistent inter-stock dependencies through a Global Context Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon Divergence weighting mechanism for adaptive relational learning and cross-modal alignment; (2) a LLM-enhanced reasoning module, which leverages a frozen large language model with lightweight adapters to semantically fuse and align quantitative and textual modalities, enriching representations with domain-specific financial knowledge; and (3) a Style-Structured Mixture of Experts (SSMoEs) that combines shared market experts and industry-specialized experts, each parameterized by learnable style vectors enabling regime-aware specialization under sparse activation. Extensive experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control. Datasets, source code, and model weights are available at our GitHub repository: https://github.com/PeilinTime/H3M-SSMoEs.
☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. Typically, these games fall into three categories based on the flow of control: strictly-sequential (where players alternate single actions), deterministic-response (where some actions trigger a fixed outcome), and unbounded reciprocal-response (where alternating counterplays are permitted). A less-explored but strategically rich structure exists: the bounded one-sided response. This dynamic occurs when a player's action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more sequential moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that specifically isolates the BORG dynamic, where a Rent action forces the opponent to sequentially choose payment assets. We demonstrate that the gold-standard algorithm, Counterfactual Regret Minimization (CFR), successfully converges on effective strategies for this domain without requiring novel algorithmic extensions. To support efficient, reproducible experimentation, we present a lightweight, full-stack research platform that unifies the environment, a parallelized CFR runtime, and a human-playable web interface, all runnable on a single workstation. This system provides a practical foundation for exploring state representation and policy learning in bounded one-sided response settings. The trained CFR agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
☆ Reasoning-Aware GRPO using Process Mining
Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
☆ GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.
☆ Scalable predictive processing framework for multitask caregiving robots
The rapid aging of societies is intensifying demand for autonomous care robots; however, most existing systems are task-specific and rely on handcrafted preprocessing, limiting their ability to generalize across diverse scenarios. A prevailing theory in cognitive neuroscience proposes that the human brain operates through hierarchical predictive processing, which underlies flexible cognition and behavior by integrating multimodal sensory signals. Inspired by this principle, we introduce a hierarchical multimodal recurrent neural network grounded in predictive processing under the free-energy principle, capable of directly integrating over 30,000-dimensional visuo-proprioceptive inputs without dimensionality reduction. The model was able to learn two representative caregiving tasks, rigid-body repositioning and flexible-towel wiping, without task-specific feature engineering. We demonstrate three key properties: (i) self-organization of hierarchical latent dynamics that regulate task transitions, capture variability in uncertainty, and infer occluded states; (ii) robustness to degraded vision through visuo-proprioceptive integration; and (iii) asymmetric interference in multitask learning, where the more variable wiping task had little influence on repositioning, whereas learning the repositioning task led to a modest reduction in wiping performance, while the model maintained overall robustness. Although the evaluation was limited to simulation, these results establish predictive processing as a universal and scalable computational principle, pointing toward robust, flexible, and autonomous caregiving robots while offering theoretical insight into the human brain's ability to achieve flexible adaptation in uncertain real-world environments.
☆ Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLMs
Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause "cultural erasure", the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.
☆ Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization
Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV's existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV's native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.
comment: 8 pages
☆ PORTool: Tool-Use LLM Training with Rewarded Tree
Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
☆ RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras
We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.
comment: 8 pages, 9 figures, submitted for review to IEEE RA-L
☆ Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/
comment: Submitted to Hydrology and Earth System Sciences
☆ Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis NeurIPS 2025
Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-encoder MoE for subgroup-aware representation learning with a hazard MoE that leverages patient features and time embeddings to capture temporal dynamics. This dual-MoE design flexibly integrates with existing deep learning based survival pipelines. On METABRIC and GBSG breast cancer datasets, our method consistently improves performance, boosting the time-dependent C-index up to 0.04 on the test sets, and yields further gains when incorporated into the Consurv framework.
comment: Accepted to NeurIPS 2025 workshop Learning from Time Series for Health (TS4H)
☆ AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys KDD 2025
The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at https://github.com/annihi1ation/auto_research.
comment: TKDD 2025
☆ The Quest for Reliable Metrics of Responsible AI
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
comment: Accepted for presentation at the AI in Science Summit 2025
☆ DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones' high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS's flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
comment: Preprint version. This manuscript is currently under review at Transportation Research Part C: Emerging Technologies. The PDF corresponds to the version submitted in June 2025. The main findings of this work were recognized with the Best Intelligent Transportation Systems Paper Award at the 2025 TRB Annual Meeting
☆ From Queries to Insights: Agentic LLM Pipelines for Spatio-Temporal Text-to-SQL SP
Natural-language-to-SQL (NL-to-SQL) systems hold promise for democratizing access to structured data, allowing users to query databases without learning SQL. Yet existing systems struggle with realistic spatio-temporal queries, where success requires aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs. We present an agentic pipeline that extends a naive text-to-SQL baseline (llama-3-sqlcoder-8b) with orchestration by a Mistral-based ReAct agent. The agent can plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools. We evaluate on 35 natural-language queries over the NYC and Tokyo check-in dataset, covering spatial, temporal, and multi-dataset reasoning. The agent achieves substantially higher accuracy than the naive baseline 91.4% vs. 28.6% and enhances usability through maps, plots, and structured natural-language summaries. Crucially, our design enables more natural human-database interaction, supporting users who lack SQL expertise, detailed schema knowledge, or prompting skill. We conclude that agentic orchestration, rather than stronger SQL generators alone, is a promising foundation for interactive geospatial assistants.
comment: 8 pages, 5 figures, GeoGenAgent'25 - ACM SIGSPATIAL
☆ Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
☆ Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.
☆ WaveVerif: Acoustic Side-Channel based Verification of Robotic Workflows
In this paper, we present a framework that uses acoustic side-channel analysis (ASCA) to monitor and verify whether a robot correctly executes its intended commands. We develop and evaluate a machine-learning-based workflow verification system that uses acoustic emissions generated by robotic movements. The system can determine whether real-time behavior is consistent with expected commands. The evaluation takes into account movement speed, direction, and microphone distance. The results show that individual robot movements can be validated with over 80% accuracy under baseline conditions using four different classifiers: Support Vector Machine (SVM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). Additionally, workflows such as pick-and-place and packing could be identified with similarly high confidence. Our findings demonstrate that acoustic signals can support real-time, low-cost, passive verification in sensitive robotic environments without requiring hardware modifications.
comment: 11 pages, 3 figures, Corresponding Author: Prof. Shishir Nagaraja (shishir.nagaraja@newcastle.ac.uk)
☆ Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs -- A Case Study in Malawi
The reliability of routine health data in low and middle-income countries (LMICs) is often constrained by reporting delays and incomplete coverage, necessitating the exploration of novel data sources and analytics. Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data into mathematical embeddings that can be efficiently used for downstream prediction tasks. This study evaluated the predictive performance of three GeoFM embedding sources - Google Population Dynamics Foundation Model (PDFM), Google AlphaEarth (derived from satellite imagery), and mobile phone call detail records (CDR) - for modeling 15 routine health programmatic outputs in Malawi, and compared their utility to traditional geospatial interpolation methods. We used XGBoost models on data from 552 health catchment areas (January 2021-May 2023), assessing performance with R2, and using an 80/20 training and test data split with 5-fold cross-validation used in training. While predictive performance was mixed, the embedding-based approaches improved upon baseline geostatistical methods in 13 of 15 (87%) indicators tested. A Multi-GeoFM model integrating all three embedding sources produced the most robust predictions, achieving average 5-fold cross validated R2 values for indicators like population density (0.63), new HIV cases (0.57), and child vaccinations (0.47) and test set R2 of 0.64, 0.68, and 0.55, respectively. Prediction was poor for prediction targets with low primary data availability, such as TB and malnutrition cases. These results demonstrate that GeoFM embeddings imbue a modest predictive improvement for select health and demographic outcomes in an LMIC context. We conclude that the integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems.
comment: 13 pages, 3010 words, 2 tables, 2 figures
☆ Estimating cognitive biases with attention-aware inverse planning
People's goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people's attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, building on recent work in computational cognitive science, we formally articulate the attention-aware inverse planning problem, in which the goal is to estimate a person's attentional biases from their actions. We demonstrate how attention-aware inverse planning systematically differs from standard inverse reinforcement learning and how cognitive biases can be inferred from behavior. Finally, we present an approach to attention-aware inverse planning that combines deep reinforcement learning with computational cognitive modeling. We use this approach to infer the attentional strategies of RL agents in real-life driving scenarios selected from the Waymo Open Dataset, demonstrating the scalability of estimating cognitive biases with attention-aware inverse planning.
☆ Revisiting Multilingual Data Mixtures in Language Model Pretraining
The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings
comment: Under Review
☆ A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows
CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution times based on sequential activity traces and static features, enabling early identification of potential deadline breaches. In tests, the system demonstrates high precision and F1 scores in predicting deadline compliance, illustrating the value of integrating process mining with machine learning for proactive software project management.
comment: 16 pages, 7 figures, 4 tables
☆ Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
☆ Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion
Algorithmic collusion has emerged as a central question in AI: Will the interaction between different AI agents deployed in markets lead to collusion? More generally, understanding how emergent behavior, be it a cartel or market dominance from more advanced bots, affects the market overall is an important research question. We propose a hierarchical multi-agent reinforcement learning framework to study algorithmic collusion in market making. The framework includes a self-interested market maker (Agent~A), which is trained in an uncertain environment shaped by an adversary, and three bottom-layer competitors: the self-interested Agent~B1 (whose objective is to maximize its own PnL), the competitive Agent~B2 (whose objective is to minimize the PnL of its opponent), and the hybrid Agent~B$^\star$, which can modulate between the behavior of the other two. To analyze how these agents shape the behavior of each other and affect market outcomes, we propose interaction-level metrics that quantify behavioral asymmetry and system-level dynamics, while providing signals potentially indicative of emergent interaction patterns. Experimental results show that Agent~B2 secures dominant performance in a zero-sum setting against B1, aggressively capturing order flow while tightening average spreads, thus improving market execution efficiency. In contrast, Agent~B$^\star$ exhibits a self-interested inclination when co-existing with other profit-seeking agents, securing dominant market share through adaptive quoting, yet exerting a milder adverse impact on the rewards of Agents~A and B1 compared to B2. These findings suggest that adaptive incentive control supports more sustainable strategic co-existence in heterogeneous agent environments and offers a structured lens for evaluating behavioral design in algorithmic trading systems.
☆ Transferring Causal Effects using Proxies NeurIPS 2025
We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.
comment: Advances in Neural Information Processing Systems (NeurIPS 2025) camera-ready version
☆ FinOps Agent -- A Use-Case for IT Infrastructure and Cost Optimization
FinOps (Finance + Operations) represents an operational framework and cultural practice which maximizes cloud business value through collaborative financial accountability across engineering, finance, and business teams. FinOps practitioners face a fundamental challenge: billing data arrives in heterogeneous formats, taxonomies, and metrics from multiple cloud providers and internal systems which eventually lead to synthesizing actionable insights, and making time-sensitive decisions. To address this challenge, we propose leveraging autonomous, goal-driven AI agents for FinOps automation. In this paper, we built a FinOps agent for a typical use-case for IT infrastructure and cost optimization. We built a system simulating a realistic end-to-end industry process starting with retrieving data from various sources to consolidating and analyzing the data to generate recommendations for optimization. We defined a set of metrics to evaluate our agent using several open-source and close-source language models and it shows that the agent was able to understand, plan, and execute tasks as well as an actual FinOps practitioner.
☆ SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.
comment: Preprint Submitted to ACM Transactions on AI for Science (TAIS)
☆ Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.
☆ PRISM: Proof-Carrying Artifact Generation through LLM x MDE Synergy and Stratified Constraints
PRISM unifies Large Language Models with Model-Driven Engineering to generate regulator-ready artifacts and machine-checkable evidence for safety- and compliance-critical domains. PRISM integrates three pillars: a Unified Meta-Model (UMM) reconciles heterogeneous schemas and regulatory text into a single semantic space; an Integrated Constraint Model (ICM) compiles structural and semantic requirements into enforcement artifacts including generation-time automata (GBNF, DFA) and post-generation validators (e.g., SHACL, SMT); and Constraint-Guided Verifiable Generation (CVG) applies these through two-layer enforcement - structural constraints drive prefix-safe decoding while semantic/logical validation produces machine-checkable certificates. When violations occur, PRISM performs audit-guided repair and records generation traces for compliance review. We evaluate PRISM in automotive software engineering (AUTOSAR) and cross-border legal jurisdiction (Brussels I bis). PRISM produces structurally valid, auditable artifacts that integrate with existing tooling and substantially reduce manual remediation effort, providing a practical path toward automated artifact generation with built-in assurance.
comment: 45 pages, 9 figures
☆ Approximating Human Preferences Using a Multi-Judge Learned System
Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
☆ The Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence
Existing frameworks converge on the centrality of compression to intelligence but leave underspecified why this process enforces the discovery of causal structure rather than superficial statistical patterns. We introduce a two-level framework to address this gap. The Information-Theoretic Imperative (ITI) establishes that any system persisting in uncertain environments must minimize epistemic entropy through predictive compression: this is the evolutionary "why" linking survival pressure to information-processing demands. The Compression Efficiency Principle (CEP) specifies how efficient compression mechanically selects for generative, causal models through exception-accumulation dynamics, making reality alignment a consequence rather than a contingent achievement. Together, ITI and CEP define a causal chain: from survival pressure to prediction necessity, compression requirement, efficiency optimization, generative structure discovery, and ultimately reality alignment. Each link follows from physical, information-theoretic, or evolutionary constraints, implying that intelligence is the mechanically necessary outcome of persistence in structured environments. This framework yields empirically testable predictions: compression efficiency, measured as approach to the rate-distortion frontier, correlates with out-of-distribution generalization; exception-accumulation rates differentiate causal from correlational models; hierarchical systems exhibit increasing efficiency across abstraction layers; and biological systems demonstrate metabolic costs that track representational complexity. ITI and CEP thereby provide a unified account of convergence across biological, artificial, and multi-scale systems, addressing the epistemic and functional dimensions of intelligence without invoking assumptions about consciousness or subjective experience.
comment: 41 pages, 2 tables, 3 appendices. Submitted to arXiv for open access
☆ AAGATE: A NIST AI RMF-Aligned Governance Platform for Agentic AI
This paper introduces the Agentic AI Governance Assurance & Trust Engine (AAGATE), a Kubernetes-native control plane designed to address the unique security and governance challenges posed by autonomous, language-model-driven agents in production. Recognizing the limitations of traditional Application Security (AppSec) tooling for improvisational, machine-speed systems, AAGATE operationalizes the NIST AI Risk Management Framework (AI RMF). It integrates specialized security frameworks for each RMF function: the Agentic AI Threat Modeling MAESTRO framework for Map, a hybrid of OWASP's AIVSS and SEI's SSVC for Measure, and the Cloud Security Alliance's Agentic AI Red Teaming Guide for Manage. By incorporating a zero-trust service mesh, an explainable policy engine, behavioral analytics, and decentralized accountability hooks, AAGATE provides a continuous, verifiable governance solution for agentic AI, enabling safe, accountable, and scalable deployment. The framework is further extended with DIRF for digital identity rights, LPCI defenses for logic-layer injection, and QSAF monitors for cognitive degradation, ensuring governance spans systemic, adversarial, and ethical risks.
☆ Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
☆ Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue
Large Language Models (LLMs) promise to transform interactive games by enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it remains unclear whether constrained prompts actually improve player experience. We investigate this question through The Interview, a voice-based detective game powered by GPT-4o. A within-subjects usability study ($N=10$) compared high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable experiential differences beyond sensitivity to technical breakdowns. Guided by these findings, we redesigned the HCP into a hybrid JSON+RAG scaffold and conducted a synthetic evaluation with an LLM judge, positioned as an early-stage complement to usability testing. Results uncovered a novel pattern: scaffolding effects were role-dependent: the Interviewer (quest-giver NPC) gained stability, while suspect NPCs lost improvisational believability. These findings overturn the assumption that tighter constraints inherently enhance play. Extending fuzzy-symbolic scaffolding, we introduce \textit{Symbolically Scaffolded Play}, a framework in which symbolic structures are expressed as fuzzy, numerical boundaries that stabilize coherence where needed while preserving improvisation where surprise sustains engagement.
☆ Identity Management for Agentic AI: The new frontier of authorization, authentication, and security for an AI agent world
The rapid rise of AI agents presents urgent challenges in authentication, authorization, and identity management. Current agent-centric protocols (like MCP) highlight the demand for clarified best practices in authentication and authorization. Looking ahead, ambitions for highly autonomous agents raise complex long-term questions regarding scalable access control, agent-centric identities, AI workload differentiation, and delegated authority. This OpenID Foundation whitepaper is for stakeholders at the intersection of AI agents and access management. It outlines the resources already available for securing today's agents and presents a strategic agenda to address the foundational authentication, authorization, and identity problems pivotal for tomorrow's widespread autonomous systems.
☆ ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion NeurIPS 2025
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
comment: NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff
☆ An Agentic Framework for Rapid Deployment of Edge AI Solutions in Industry 5.0
We present a novel framework for Industry 5.0 that simplifies the deployment of AI models on edge devices in various industrial settings. The design reduces latency and avoids external data transfer by enabling local inference and real-time processing. Our implementation is agent-based, which means that individual agents, whether human, algorithmic, or collaborative, are responsible for well-defined tasks, enabling flexibility and simplifying integration. Moreover, our framework supports modular integration and maintains low resource requirements. Preliminary evaluations concerning the food industry in real scenarios indicate improved deployment time and system adaptability performance. The source code is publicly available at https://github.com/AI-REDGIO-5-0/ci-component.
♻ ☆ 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
♻ ☆ SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.
comment: 27 pages, 8 figures, 5 tables. Minor update: added corrected acknowledgments and corrected a misstated hyperparameter value (noted in footnote) for reproducibility. Submitted to AAS Journals. Comments welcome
♻ ☆ Curiosity-driven RL for symbolic equation solving NeurIPS 2025
We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.
comment: Accepted at the NeurIPS 2025 MATH-AI Workshop
♻ ☆ Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).
♻ ☆ Spontaneous Giving and Calculated Greed in Language Models EMNLP 2025
Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of "spontaneous giving and calculated greed." Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address--rather than reinforce--the challenges of collective action.
comment: Accepted to EMNLP 2025 main conference and selected as an Oral Presentation
♻ ☆ Dynamic Risk Assessments for Offensive Cybersecurity Agents
Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
comment: 26 pages, 11 figures
♻ ☆ Which Demographic Features Are Relevant for Individual Fairness Evaluation of U.S. Recidivism Risk Assessment Tools?
Despite its constitutional relevance, the technical ``individual fairness'' criterion has not been operationalized in U.S. state or federal statutes/regulations. We conduct a human subjects experiment to address this gap, evaluating which demographic features are relevant for individual fairness evaluation of recidivism risk assessment (RRA) tools. Our analyses conclude that the individual similarity function should consider age and sex, but it should ignore race.
comment: ICAIL 2025
♻ ☆ Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means
Clustering problems such as $k$-means and $k$-median are staples of unsupervised learning, and many algorithmic techniques have been developed to tackle their numerous aspects. In this paper, we focus on the class of greedy approximation algorithm, that attracted less attention than local-search or primal-dual counterparts. In particular, we study the recursive greedy algorithm developed by Mettu and Plaxton [SIAM J. Comp 2003]. We provide a simplification of the algorithm, allowing for faster implementation, in graph metrics or in Euclidean space, where our algorithm matches or improves the state-of-the-art.
♻ ☆ RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots
The presence of autonomous systems is growing at a fast pace and it is impacting many aspects of our lives. Designed to learn and act independently, these systems operate and perform decision-making without human intervention. However, they lack the ability to incorporate users' ethical preferences, which are unique for each individual in society and are required to personalize the decision-making processes. This reduces user trust and prevents autonomous systems from behaving according to the moral beliefs of their end-users. When multiple systems interact with differing ethical preferences, they must negotiate to reach an agreement that satisfies the ethical beliefs of all the parties involved and adjust their behavior consequently. To address this challenge, this paper proposes RobEthiChor, an approach that enables autonomous systems to incorporate user ethical preferences and contextual factors into their decision-making through ethics-based negotiation. RobEthiChor features a domain-agnostic reference architecture for designing autonomous systems capable of ethic-based negotiating. The paper also presents RobEthiChor-Ros, an implementation of RobEthiChor within the Robot Operating System (ROS), which can be deployed on robots to provide them with ethics-based negotiation capabilities. To evaluate our approach, we deployed RobEthiChor-Ros on real robots and ran scenarios where a pair of robots negotiate upon resource contention. Experimental results demonstrate the feasibility and effectiveness of the system in realizing ethics-based negotiation. RobEthiChor allowed robots to reach an agreement in more than 73% of the scenarios with an acceptable negotiation time (0.67s on average). Experiments also demonstrate that the negotiation approach implemented in RobEthiChor is scalable.
♻ ☆ Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
We present LrcSSM, a $\textit{non-linear}$ recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Importantly, the diagonal Jacobian structure of our model results in no performance loss compared to the original model with dense Jacobian, and the approach can be generalized to other non-linear recurrent models, demonstrating broader applicability. On a suite of long-range forecasting tasks, we demonstrate that LrcSSM outperforms Transformers, LRU, S5, and Mamba.
♻ ☆ DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO NeurIPS 2025
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
comment: NeurIPS 2025
♻ ☆ Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs ICDM 2025
The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.
comment: Accepted at AI4FCF-ICDM 2025
♻ ☆ The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
♻ ☆ PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
comment: 9 pages, 2 figures, 4 tables; v2: corrected spelling of a co-author name; no content changes
♻ ☆ Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
♻ ☆ RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning
Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.
♻ ☆ Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.
comment: Code and models are available at https://github.com/yophis/decom-renorm-merge
♻ ☆ NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications
To effectively process impulse noise for narrowband powerline communications (NB-PLCs) transceivers, capturing comprehensive statistics of nonperiodic asynchronous impulsive noise (APIN) is a critical task. However, existing mathematical noise generative models only capture part of the characteristics of noise. In this study, we propose a novel generative adversarial network (GAN) called noise generation GAN (NGGAN) that learns the complicated characteristics of practically measured noise samples for data synthesis. To closely match the statistics of complicated noise over the NB-PLC systems, we measured the NB-PLC noise via the analog coupling and bandpass filtering circuits of a commercial NB-PLC modem to build a realistic dataset. To train NGGAN, we adhere to the following principles: 1) we design the length of input signals that the NGGAN model can fit to facilitate cyclostationary noise generation; 2) the Wasserstein distance is used as a loss function to enhance the similarity between the generated noise and training data; and 3) to measure the similarity performances of GAN-based models based on the mathematical and practically measured datasets, we conduct both quantitative and qualitative analyses. The training datasets include: 1) a piecewise spectral cyclostationary Gaussian model (PSCGM); 2) a frequency-shift (FRESH) filter; and 3) practical measurements from NB-PLC systems. Simulation results demonstrate that the generated noise samples from the proposed NGGAN are highly close to the real noise samples. The principal component analysis (PCA) scatter plots and Fr\'echet inception distance (FID) analysis have shown that NGGAN outperforms other GAN-based models by generating noise samples with superior fidelity and higher diversity.
comment: 16 pages, 15 figures, 11 tables, and published in IEEE Transactions on Instrumentation and Measurement, 2025
♻ ☆ Towards a Common Framework for Autoformalization
Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit - instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems.
♻ ☆ GroupSHAP-Guided Integration of Financial News Keywords and Technical Indicators for Stock Price Prediction
Recent advances in finance-specific language models such as FinBERT have enabled the quantification of public sentiment into index-based measures, yet compressing diverse linguistic signals into single metrics overlooks contextual nuances and limits interpretability. To address this limitation, explainable AI techniques, particularly SHAP (SHapley Additive Explanations), have been employed to identify influential features. However, SHAP's computational cost grows exponentially with input features, making it impractical for large-scale text-based financial data. This study introduces a GRU-based forecasting framework enhanced with GroupSHAP, which quantifies contributions of semantically related keyword groups rather than individual tokens, substantially reducing computational burden while preserving interpretability. We employed FinBERT to embed news articles from 2015 to 2024, clustered them into coherent semantic groups, and applied GroupSHAP to measure each group's contribution to stock price movements. The resulting group-level SHAP variables across multiple topics were used as input features for the prediction model. Empirical results from one-day-ahead forecasting of the S&P 500 index throughout 2024 demonstrate that our approach achieves a 32.2% reduction in MAE and a 40.5% reduction in RMSE compared with benchmark models without the GroupSHAP mechanism. This research presents the first application of GroupSHAP in news-driven financial forecasting, showing that grouped sentiment representations simultaneously enhance interpretability and predictive performance.
comment: 6 pages
♻ ☆ A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring
Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.
♻ ☆ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
♻ ☆ Quantum Transformer: Accelerating model inference via quantum linear algebra
Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the lens of fault-tolerant quantum computing. We develop quantum subroutines to construct the building blocks in the transformer, including the self-attention, residual connection with layer normalization, and feed-forward network. As an important subroutine, we show how to efficiently implement the Hadamard product and element-wise functions of matrices on quantum computers. Our algorithm prepares an amplitude encoding of the transformer output, which can be measured for prediction or use in the next layer. We find that the matrix norm of the input sequence plays a dominant role in the quantum complexity. With numerical experiments on open-source LLMs, including for bio-informatics applications, we demonstrate the potential of a quantum speedup for transformer inference in practical regimes.
comment: 45 pages
♻ ☆ Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation NeurIPS 2025
Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks. Our code is available in https://github.com/Code2Q/TagCF.
comment: to be published in NeurIPS 2025
♻ ☆ The AI_INFN Platform: Artificial Intelligence Development in the Cloud
Machine Learning (ML) is profoundly reshaping the way researchers create, implement, and operate data-intensive software. Its adoption, however, introduces notable challenges for computing infrastructures, particularly when it comes to coordinating access to hardware accelerators across development, testing, and production environments. The INFN initiative AI_INFN (Artificial Intelligence at INFN) seeks to promote the use of ML methods across various INFN research scenarios by offering comprehensive technical support, including access to AI-focused computational resources. Leveraging the INFN Cloud ecosystem and cloud-native technologies, the project emphasizes efficient sharing of accelerator hardware while maintaining the breadth of the Institute's research activities. This contribution describes the deployment and commissioning of a Kubernetes-based platform designed to simplify GPU-powered data analysis workflows and enable their scalable execution on heterogeneous distributed resources. By integrating offloading mechanisms through Virtual Kubelet and the InterLink API, the platform allows workflows to span multiple resource providers, from Worldwide LHC Computing Grid sites to high-performance computing centers like CINECA Leonardo. We will present preliminary benchmarks, functional tests, and case studies, demonstrating both performance and integration outcomes.
♻ ☆ Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents -- Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.
♻ ☆ Continuous Domain Generalization NeurIPS25
Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic contexts. However, existing domain generalization approaches typically treat domains as discrete or as evolving along a single axis (e.g., time). This oversimplification fails to capture the complex, multidimensional nature of real-world variation. This paper introduces the task of Continuous Domain Generalization (CDG), which aims to generalize predictive models to unseen domains defined by arbitrary combinations of continuous variations. We present a principled framework grounded in geometric and algebraic theories, showing that optimal model parameters across domains lie on a low-dimensional manifold. To model this structure, we propose a Neural Lie Transport Operator (NeuralLio), which enables structure-preserving parameter transitions by enforcing geometric continuity and algebraic consistency. To handle noisy or incomplete domain variation descriptors, we introduce a gating mechanism to suppress irrelevant dimensions and a local chart-based strategy for robust generalization. Extensive experiments on synthetic and real-world datasets, including remote sensing, scientific documents, and traffic forecasting, demonstrate that our method significantly outperforms existing baselines in both generalization accuracy and robustness.
comment: 23 pages, 9 figures. Accepted by NeurIPS25
♻ ☆ GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models
Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.
♻ ☆ Reinforcement Learning Teachers of Test Time Scaling NeurIPS 2025
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
comment: Accepted at NeurIPS 2025
♻ ☆ Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine
Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using fewer allocated CPU hours. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is publicly available at https://github.com/metauto-ai/HGM.
♻ ☆ Group Interventions on Deep Networks for Causal Discovery in Subsystems
Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.
comment: Submitted to IEEE Access. We are working on the revised version
♻ ☆ Many LLMs Are More Utilitarian Than One NeurIPS 2025
Moral judgment is integral to large language models' (LLMs) social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function when collaborating compared to operating as individual agents. In human moral judgment, group deliberation leads to a Utilitarian Boost: a tendency to endorse norm violations that inflict harm but maximize benefits for the greatest number of people. We study whether a similar dynamic emerges in multi-agent LLM systems. We test six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reason independently, and (2) Group, where they engage in multi-turn discussions in pairs or triads. In personal dilemmas, where agents decide whether to directly harm an individual for the benefit of others, all models rated moral violations as more acceptable when part of a group, demonstrating a Utilitarian Boost similar to that observed in humans. However, the mechanism for the Boost in LLMs differed: While humans in groups become more utilitarian due to heightened sensitivity to decision outcomes, LLM groups showed either reduced sensitivity to norms or enhanced impartiality. We report model differences in when and how strongly the Boost manifests. We also discuss prompt and agent compositions that enhance or mitigate the effect. We end with a discussion of the implications for AI alignment, multi-agent design, and artificial moral reasoning. Code available at: https://github.com/baltaci-r/MoralAgents
comment: Accepted to the Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models NeurIPS 2025
Foundation models demand advanced data processing for their vast, multimodal datasets. However, traditional frameworks struggle with the unique complexities of multimodal data. In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.
comment: Accepted by NeurIPS 2025 (Spotlight). 43 pages, 16 figures, 4 tables
♻ ☆ Securing AI Agent Execution
Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a practical foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.
♻ ☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.
♻ ☆ SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
♻ ☆ Brain-inspired Computational Intelligence via Predictive Coding
Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with a learning algorithm called error backpropagation, always considered biologically implausible. To this end, recent works have studied learning algorithms for deep neural networks inspired by the neurosciences. One such theory, called predictive coding (PC), has shown promising properties that make it potentially valuable for the machine learning community: it can model information processing in different areas of the brain, can be used in control and robotics, has a solid mathematical foundation in variational inference, and performs its computations asynchronously. Inspired by such properties, works that propose novel PC-like algorithms are starting to be present in multiple sub-fields of machine learning and AI at large. Here, we survey such efforts by first providing a broad overview of the history of PC to provide common ground for the understanding of the recent developments, then by describing current efforts and results, and concluding with a large discussion of possible implications and ways forward.
comment: 26 Pages, 9 Figures
♻ ☆ HyperMARL: Adaptive Hypernetworks for Multi-Agent RL NeurIPS 2025
Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.
comment: To appear at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). A preliminary version of this work was presented at the CoCoMARL workshop, RLC 2025
♻ ☆ Reliable Evaluation and Benchmarks for Statement Autoformalization EMNLP 2025
Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
comment: Accepted to EMNLP 2025. New benchmarks released, see https://github.com/augustepoiroux/RLMEval , https://huggingface.co/datasets/PAug/ProofNetSharp , and https://huggingface.co/datasets/PAug/ProofNetVerif . For code, see https://github.com/augustepoiroux/LeanInteract
♻ ☆ OmegAMP: Targeted AMP Discovery through Biologically Informed Generation
Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.
♻ ☆ Transformers from Compressed Representations
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
♻ ☆ When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks but remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks. Existing mitigation strategies rely on injecting heuristic safety signals during training, which often suppress reasoning ability and fail to resolve the safety-reasoning trade-off. To systematically investigate this issue, we analyze the reasoning trajectories of diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models override their own risk assessments and justify responding to unsafe prompts. This finding reveals that LRMs inherently possess the ability to reject unsafe queries, but this ability is compromised, resulting in harmful outputs. Building on these insights, we propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps, steering the model back onto safe trajectories while preserving valid reasoning chains. Extensive experiments across multiple reasoning and safety benchmarks demonstrate that CoG substantially improves the safety of current LRMs while preserving comparable reasoning ability, significantly outperforming prior methods that suffer from severe safety-reasoning trade-offs.
comment: First two authors contributed equally. The main text is 10 pages, with an appendix of 19 pages. The paper contains 18 figures and 16 tables
♻ ☆ Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes
The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.
comment: Presented at the 37th Benelux Conference on Artificial Intelligence and the 34th Belgian Dutch Conference on Machine Learning (BNAIC/BeNeLearn 2025)
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code; examples at https://tabarena.ai/code-examples; dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ Probabilistic Kernel Function for Fast Angle Testing
In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.
♻ ☆ Steiner Traveling Salesman Problem with Quantum Annealing GECCO 2025
The Steiner Traveling Salesman Problem (STSP) is a variant of the classical Traveling Salesman Problem. The STSP involves incorporating steiner nodes, which are extra nodes not originally part of the required visit set but that can be added to the route to enhance the overall solution and minimize the total travel cost. Given the NP-hard nature of the STSP, we propose a quantum approach to address it. Specifically, we employ quantum annealing using D-Wave's hardware to explore its potential for solving this problem. To enhance computational feasibility, we develop a preprocessing method that effectively reduces the network size. Our experimental results demonstrate that this reduction technique significantly decreases the problem complexity, making the Quadratic Unconstrained Binary Optimization formulation, the standard input for quantum annealers, better suited for existing quantum hardware. Furthermore, the results highlight the potential of quantum annealing as a promising and innovative approach for solving the STSP.
comment: 7 pages, 1 figure, 6 tables. Paper accepted in The Genetic and Evolutionary Computation Conference (GECCO 2025)
♻ ☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
♻ ☆ Learning-Augmented Online Bipartite Fractional Matching NeurIPS 2025
Online bipartite matching is a fundamental problem in online optimization, extensively studied both in its integral and fractional forms due to its theoretical significance and practical applications, such as online advertising and resource allocation. Motivated by recent progress in learning-augmented algorithms, we study online bipartite fractional matching when the algorithm is given advice in the form of a suggested matching in each iteration. We develop algorithms for both the vertex-weighted and unweighted variants that provably dominate the naive "coin flip" strategy of randomly choosing between the advice-following and advice-free algorithms. Moreover, our algorithm for the vertex-weighted setting extends to the AdWords problem under the small bids assumption, yielding a significant improvement over the seminal work of Mahdian, Nazerzadeh, and Saberi (EC 2007, TALG 2012). Complementing our positive results, we establish a hardness bound on the robustness-consistency tradeoff that is attainable by any algorithm. We empirically validate our algorithms through experiments on synthetic and real-world data.
comment: To appear in NeurIPS 2025. Full version
♻ ☆ Differential Mamba AACL 2025
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available: https://github.com/NadavSc/Diff-Mamba
comment: AACL 2025. We provide the code at https://github.com/NadavSc/Diff-Mamba
♻ ☆ Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning
Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.
comment: 16 pages, 4 figures, 4 tables
♻ ☆ A PBN-RL-XAI Framework for Discovering a "Hit-and-Run" Therapeutic Strategy in Melanoma
Innate resistance to anti-PD-1 immunotherapy remains a major clinical challenge in metastatic melanoma, with the underlying molecular networks being poorly understood. To address this, we constructed a dynamic Probabilistic Boolean Network model using transcriptomic data from patient tumor biopsies to elucidate the regulatory logic governing therapy response. We then employed a reinforcement learning agent to systematically discover optimal, multi-step therapeutic interventions and used explainable artificial intelligence to mechanistically interpret the agent's control policy. The analysis revealed that a precisely timed, 4-step temporary inhibition of the lysyl oxidase like 2 protein (LOXL2) was the most effective strategy. Our explainable analysis showed that this ''hit-and-run" intervention is sufficient to erase the molecular signature driving resistance, allowing the network to self-correct without requiring sustained intervention. This study presents a novel, time-dependent therapeutic hypothesis for overcoming immunotherapy resistance and provides a powerful computational framework for identifying non-obvious intervention protocols in complex biological systems.
comment: 7 pages, 7 figures. Accepted by the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2025. Code is available at https://github.com/Liu-Zhonglin/pbn-melanoma-project
♻ ☆ Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner NeurIPS 2025
The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.
comment: NeurIPS 2025
♻ ☆ Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting ICLR 2025
The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.
comment: Accepted by ICLR 2025
♻ ☆ UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss NeurIPS 2025
Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh topologies.In this paper, we present an Unsupervised and Generalizable Mesh Movement Network (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal level.Experimental results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling.
comment: Accepted as a Poster at NeurIPS 2025
♻ ☆ Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting NeurIPS 2025
Spatio-temporal forecasting is crucial in many domains, such as transportation, meteorology, and energy. However, real-world scenarios frequently present challenges such as signal anomalies, noise, and distributional shifts. Existing solutions primarily enhance robustness by modifying network architectures or training procedures. Nevertheless, these approaches are computationally intensive and resource-demanding, especially for large-scale applications. In this paper, we explore a novel test-time computing paradigm, namely learning with calibration, ST-TTC, for spatio-temporal forecasting. Through learning with calibration, we aim to capture periodic structural biases arising from non-stationarity during the testing phase and perform real-time bias correction on predictions to improve accuracy. Specifically, we first introduce a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift and then propose a flash updating mechanism with a streaming memory queue for efficient test-time computation. ST-TTC effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm. Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment
Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.
♻ ☆ Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving
User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.
♻ ☆ MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-025-51369-x}
♻ ☆ A Survey of AI Scientists: Surveying the automatic Scientists and Research
Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field's evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.
comment: 28 pages, 9 figures, 1 table
♻ ☆ LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities
Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .
comment: Project page: https://ml-jku.github.io/LaM-SLidE/
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
The heterogeneity of multimodal data leads to inconsistencies and imbalance, allowing a dominant modality to steer gradient updates. Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance or conduct its quantitative analysis. To address this gap, we propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function. We define the Modality Gap as the Softmax score difference between modalities for the correct class and model its distribution using a bimodal Gaussian Mixture Model(GMM), representing balanced and imbalanced samples. Using Bayes' theorem, we estimate each sample's posterior probability of belonging to these two groups. Based on this, our adaptive loss (1) minimizes the overall Modality Gap, (2) aligns imbalanced samples with balanced ones, and (3) adaptively penalizes each according to its imbalance degree. A two-stage training strategy-warm-up and adaptive phases,yields state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with high-quality samples identified by the GMM further improves results, highlighting their value for effective multimodal fusion.
♻ ☆ The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
♻ ☆ NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.
♻ ☆ Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space
Molecular dynamics (MD) is a powerful tool for exploring the behavior of atomistic systems, but its reliance on sequential numerical integration limits simulation efficiency. We present a novel neural network architecture, MDtrajNet, and a pre-trained foundational model, MDtrajNet-1, that directly generates MD trajectories across chemical space, bypassing force calculations and integration. This approach accelerates simulations by up to two orders of magnitude compared to traditional MD, even those enhanced by machine-learning interatomic potentials. MDtrajNet combines equivariant neural networks with a transformer-based architecture to achieve strong accuracy and transferability in predicting long-time trajectories. Remarkably, the errors of the trajectories generated by MDtrajNet-1 for various known and unseen molecular systems are close to those of the conventional ab initio MD. The architecture's flexible design supports diverse application scenarios, including different statistical ensembles, boundary conditions, and interaction types. By overcoming the intrinsic speed barrier of conventional MD, MDtrajNet opens new frontiers in efficient and scalable atomistic simulations.
♻ ☆ InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
♻ ☆ TraveLLM: Could you plan my new public transit route in face of a network disruption? SC 2025
Existing navigation systems often fail during urban disruptions, struggling to incorporate real-time events and complex user constraints, such as avoiding specific areas. We address this gap with TraveLLM, a system using Large Language Models (LLMs) for disruption-aware public transit routing. We leverage LLMs' reasoning capabilities to directly process multimodal user queries combining natural language requests (origin, destination, preferences, disruption info) with map data (e.g., subway, bus, bike-share). To evaluate this approach, we design challenging test scenarios reflecting real-world disruptions like weather events, emergencies, and dynamic service availability. We benchmark the performance of state-of-the-art LLMs, including GPT-4, Claude 3, and Gemini, on generating accurate travel plans. Our experiments demonstrate that LLMs, notably GPT-4, can effectively generate viable and context-aware navigation plans under these demanding conditions. These findings suggest a promising direction for using LLMs to build more flexible and intelligent navigation systems capable of handling dynamic disruptions and diverse user needs.
comment: Accepted to ITSC 2025
♻ ☆ Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework's effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2\% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0\% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.
♻ ☆ What Work is AI Actually Doing? Uncovering the Drivers of Generative AI Adoption
Purpose: The rapid integration of artificial intelligence (AI) systems like ChatGPT, Claude AI, etc., has a deep impact on how work is done. Predicting how AI will reshape work requires understanding not just its capabilities, but how it is actually being adopted. This study investigates which intrinsic task characteristics drive users' decisions to delegate work to AI systems. Methodology: This study utilizes the Anthropic Economic Index dataset of four million Claude AI interactions mapped to O*NET tasks. We systematically scored each task across seven key dimensions: Routine, Cognitive, Social Intelligence, Creativity, Domain Knowledge, Complexity, and Decision Making using 35 parameters. We then employed multivariate techniques to identify latent task archetypes and analyzed their relationship with AI usage. Findings: Tasks requiring high creativity, complexity, and cognitive demand, but low routineness, attracted the most AI engagement. Furthermore, we identified three task archetypes: Dynamic Problem Solving, Procedural & Analytical Work, and Standardized Operational Tasks, demonstrating that AI applicability is best predicted by a combination of task characteristics, over individual factors. Our analysis revealed highly concentrated AI usage patterns, with just 5% of tasks accounting for 59% of all interactions. Originality: This research provides the first systematic evidence linking real-world generative AI usage to a comprehensive, multi-dimensional framework of intrinsic task characteristics. It introduces a data-driven classification of work archetypes that offers a new framework for analyzing the emerging human-AI division of labor.
comment: 22 pages
♻ ☆ A Neural Symbolic Model for Space Physics
In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.
♻ ☆ Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.
♻ ☆ Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.
♻ ☆ Why Foundation Models in Pathology Are Failing
In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.
♻ ☆ VDSAgents: A PCS-Guided Multi-Agent System for Veridical Data Science Automation
Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM-driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real-world datasets. This paper provides VDSAgents, a multi-agent system grounded in the Predictability-Computability-Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state-of-the-art end-to-end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek-V3 and GPT-4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM-driven data science automation.
comment: 29 pages, 6 figures. Yunxuan Jiang and Silan Hu contributed equally. Code available at https://github.com/fengzer/VDSAgents . Submitted to Stat (manuscript ID: STAT-25-0222.R1, under review)
♻ ☆ Intelligent Design 4.0: Paradigm Evolution Toward the Agentic AI Era
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by foundation model-based agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of foundation model-based multi-agent collaboration. We propose an ontological framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss challenges and opportunities of ID 4.0, including perspectives on data foundations, agent collaboration mechanisms, and the formulation of design problems and objectives. In sum, these insights provide a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing the growing complexity of engineering design.
comment: 17 pages, 7 figures
♻ ☆ Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude AAAI
Recent advances in Large Language Models (LLMs) have enabled human-like responses across various tasks, raising questions about their ethical decision-making capabilities and potential biases. This study systematically evaluates how nine popular LLMs (both open-source and closed-source) respond to ethical dilemmas involving protected attributes. Across 50,400 trials spanning single and intersectional attribute combinations in four dilemma scenarios (protective vs. harmful), we assess models' ethical preferences, sensitivity, stability, and clustering patterns. Results reveal significant biases in protected attributes in all models, with differing preferences depending on model type and dilemma context. Notably, open-source LLMs show stronger preferences for marginalized groups and greater sensitivity in harmful scenarios, while closed-source models are more selective in protective situations and tend to favor mainstream groups. We also find that ethical behavior varies across dilemma types: LLMs maintain consistent patterns in protective scenarios but respond with more diverse and cognitively demanding decisions in harmful ones. Furthermore, models display more pronounced ethical tendencies under intersectional conditions than in single-attribute settings, suggesting that complex inputs reveal deeper biases. These findings highlight the need for multi-dimensional, context-aware evaluation of LLMs' ethical behavior and offer a systematic evaluation and approach to understanding and addressing fairness in LLM decision-making.
comment: This paper has been accepted by International AAAI Conference on Web and Social Media 2026 (ICWSM 2026), sunny Los Angeles, California
♻ ☆ FedCLF -- Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks
Federated Learning (FL) is a distributed machine learning technique that preserves data privacy by sharing only the trained parameters instead of the client data. This makes FL ideal for highly dynamic, heterogeneous, and time-critical applications, in particular, the Internet of Vehicles (IoV) networks. However, FL encounters considerable challenges in such networks owing to the high data and device heterogeneity. To address these challenges, we propose FedCLF, i.e., FL with Calibrated Loss and Feedback control, which introduces calibrated loss as a utility in the participant selection process and a feedback control mechanism to dynamically adjust the sampling frequency of the clients. The envisaged approach (a) enhances the overall model accuracy in case of highly heterogeneous data and (b) optimizes the resource utilization for resource constrained IoV networks, thereby leading to increased efficiency in the FL process. We evaluated FedCLF vis-\`a-vis baseline models, i.e., FedAvg, Newt, and Oort, using CIFAR-10 dataset with varying data heterogeneity. Our results depict that FedCLF significantly outperforms the baseline models by up to a 16% improvement in high data heterogeneity-related scenarios with improved efficiency via reduced sampling frequency.
comment: Already published in ADMA 2024 on 13th December 2024 Wijethilake, K.E., Mahmood, A., Sheng, Q.Z. (2025). FedCLF - Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15388. Springer, Singapore. https://doi.org/10.1007/978-981-96-0814-0_15
♻ ☆ Consistency of Responses and Continuations Generated by Large Language Models on Social Media AAAI
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.
comment: This paper has been accepted by the International AAAI Conference on Web and Social Media (ICWSM) 2026 (Los Angeles, California, U.S.)
♻ ☆ AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
comment: 2 tables, 5 figures
♻ ☆ When Truthful Representations Flip Under Deceptive Instructions?
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip'', such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model's instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces. % Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty, offering insights for LLM detection and control. Our findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs.
♻ ☆ PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions NeurIPS 2025
Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.
comment: Accepted as a Spotlight at NeurIPS 2025 Datasets and Benchmarks Track (10 pages for main text, 4 pages for references, 36 pages for supplementary materials)
♻ ☆ Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
comment: camera ready revision
♻ ☆ DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF). DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences. We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews. DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios. Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
comment: In Submission
♻ ☆ ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups
Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
comment: For associated code repository, see http://github.com/banyasp/consistencyAI For user-friendly web app, see http://v0-llm-comparison-webapp.vercel.app/
♻ ☆ Think Then Embed: Generative Context Improves Multimodal Embedding
There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
♻ ☆ AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
comment: https://github.com/Open-Social-World/autolibra
♻ ☆ Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum NeurIPS 2025
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
comment: NeurIPS 2025
♻ ☆ Integrating Genomics into Multimodal EHR Foundation Models
This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model's predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture's versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.
♻ ☆ Constrained Posterior Sampling: Time Series Generation with Hard Constraints
Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data. In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature. Consider, for example, generating electricity demand patterns with constraints on peak demand times. This can be used to stress-test the functioning of power grids during adverse weather conditions. Existing approaches for generating constrained time series are either not scalable or degrade sample quality. To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update. Notably, CPS scales to a large number of constraints ($\sim100$) without requiring additional training. We provide theoretical justifications highlighting the impact of our projection step on sampling. Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 70\% and 22\%, respectively, on real-world stocks, traffic, and air quality datasets.
♻ ☆ Reward Collapse in Aligning Large Language Models
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
comment: Accepted for publication in the Journal of Data Science (JDS), reference JDS1201
♻ ☆ Learning World Models for Interactive Video Generation
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
comment: Project page: https://sites.google.com/view/vrag
♻ ☆ Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
♻ ☆ Language Models can Self-Improve at State-Value Estimation for Better Search
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.
♻ ☆ GenIR: Generative Visual Feedback for Mental Image Retrieval NeurIPS 2025
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction
comment: NeurIPS 2025
♻ ☆ Vital Insight: Assisting Experts' Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-In-The-Loop LLM
Passive tracking methods, such as phone and wearable sensing, have become dominant in monitoring human behaviors in modern ubiquitous computing studies. While there have been significant advances in machine-learning approaches to translate periods of raw sensor data to model momentary behaviors, (e.g., physical activity recognition), there still remains a significant gap in the translation of these sensing streams into meaningful, high-level, context-aware insights that are required for various applications (e.g., summarizing an individual's daily routine). To bridge this gap, experts often need to employ a context-driven sensemaking process in real-world studies to derive insights. This process often requires manual effort and can be challenging even for experienced researchers due to the complexity of human behaviors. We conducted three rounds of user studies with 21 experts to explore solutions to address challenges with sensemaking. We follow a human-centered design process to identify needs and design, iterate, build, and evaluate Vital Insight (VI), a novel, LLM-assisted, prototype system to enable human-in-the-loop inference (sensemaking) and visualizations of multi-modal passive sensing data from smartphones and wearables. Using the prototype as a technology probe, we observe experts' interactions with it and develop an expert sensemaking model that explains how experts move between direct data representations and AI-supported inferences to explore, question, and validate insights. Through this iterative process, we also synthesize and discuss a list of design implications for the design of future AI-augmented visualization systems to better assist experts' sensemaking processes in multi-modal health sensing data.
comment: Jiachen Li, Xiwen Li, Justin Steinberg, Akshat Choube, Bingsheng Yao, Xuhai Xu, Dakuo Wang, Elizabeth Mynatt, and Varun Mishra. 2025. Vital Insight: Assisting Experts' Context-Driven Sensemaking of Multi-modal Personal Tracking Data Using Visualization and Human-in-the-Loop LLM. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 9, 3, Article 101 (September 2025), 37 pages
♻ ☆ Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.
♻ ☆ M-Prometheus: A Suite of Open Multilingual LLM Judges
The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on synthetic multilingual feedback data instead of translated data. We release our models, training dataset, and code.
♻ ☆ Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
♻ ☆ Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
♻ ☆ Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning
Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains--Planet-scale Imagery, Population, and Environment--and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.
♻ ☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.
comment: Accepted by EMNLP 2025
♻ ☆ Latent Chain-of-Thought for Visual Reasoning NeurIPS 2025
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
comment: NeurIPS 2025
♻ ☆ Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding
We study how prompt-level inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding method paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which prompt-level cognitive scaffolds can reliably shape emergent instructional strategies in LLMs.
♻ ☆ Plasticity as the Mirror of Empowerment NeurIPS 2025
Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency
comment: NeurIPS 2025
Machine Learning 227
☆ E-Scores for (In)Correctness Assessment of Generative Model Outputs
While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
☆ Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions NeurIPS 2025
Stochastic differential equations (SDEs) are well suited to modelling noisy and irregularly sampled time series found in finance, physics, and machine learning. Traditional approaches require costly numerical solvers to sample between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent variants, which directly learn (latent) SDE transition laws using conditional normalising flows with architectural constraints that preserve properties inherited from stochastic flows. This enables one-shot sampling between arbitrary states and yields up to two orders of magnitude speed-ups at large time gaps. Experiments on synthetic SDE simulations and on real-world tracking and video data show that NSFs maintain distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling.
comment: NeurIPS 2025 (poster). Project page: https://nkiyohara.github.io/nsf-neurips2025/
☆ Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning
Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still struggle to generalize as well as possible when trained from scratch on tens of thousands of instances.
☆ MLPrE -- A tool for preprocessing and exploratory data analysis prior to machine learning model construction
With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for input and output, filtering, basic statistics, feature engineering, and exploratory data analysis. A total of 69 stages were implemented into MLPrE, of which we highlight and demonstrate key stages using six diverse datasets. We further highlight MLPrE's ability to independently process multiple fields in flat files and recombine them, otherwise requiring an additional pipeline, using a UniProt glossary term dataset. Building on this advantage, we demonstrated the clustering stage with available wine quality data. Lastly, we demonstrate the preparation of data for a graph database in the final stages of MLPrE using phosphosite kinase data. Overall, our MLPrE tool offers a generalizable and scalable tool for preprocessing and early data analysis, filling a critical need for such a tool given the ever expanding use of machine learning. This tool serves to accelerate and simplify early stage development in larger workflows.
☆ How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs NeurIPS 2025
Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.
comment: NeurIPS 2025, 24 pages, 6 figures
☆ Meshless solutions of PDE inverse problems on irregular geometries
Solving inverse and optimization problems over solutions of nonlinear partial differential equations (PDEs) on complex spatial domains is a long-standing challenge. Here we introduce a method that parameterizes the solution using spectral bases on arbitrary spatiotemporal domains, whereby the basis is defined on a hyperrectangle containing the true domain. We find the coefficients of the basis expansion by solving an optimization problem whereby both the equations, the boundary conditions and any optimization targets are enforced by a loss function, building on a key idea from Physics-Informed Neural Networks (PINNs). Since the representation of the function natively has exponential convergence, so does the solution of the optimization problem, as long as it can be solved efficiently. We find empirically that the optimization protocols developed for machine learning find solutions with exponential convergence on a wide range of equations. The method naturally allows for the incorporation of data assimilation by including additional terms in the loss function, and for the efficient solution of optimization problems over the PDE solutions.
☆ Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
☆ LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries
We introduce a method for efficiently solving initial-boundary value problems (IBVPs) that uses Lie symmetries to enforce the associated partial differential equation (PDE) exactly by construction. By leveraging symmetry transformations, the model inherently incorporates the physical laws and learns solutions from initial and boundary data. As a result, the loss directly measures the model's accuracy, leading to improved convergence. Moreover, for well-posed IBVPs, our method enables rigorous error estimation. The approach yields compact models, facilitating an efficient optimization. We implement LieSolver and demonstrate its application to linear homogeneous PDEs with a range of initial conditions, showing that it is faster and more accurate than physics-informed neural networks (PINNs). Overall, our method improves both computational efficiency and the reliability of predictions for PDE-constrained problems.
☆ Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction
A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.
☆ Scaling flow-based approaches for topology sampling in $\mathrm{SU}(3)$ gauge theory
We develop a methodology based on out-of-equilibrium simulations to mitigate topological freezing when approaching the continuum limit of lattice gauge theories. We reduce the autocorrelation of the topological charge employing open boundary conditions, while removing exactly their unphysical effects using a non-equilibrium Monte Carlo approach in which periodic boundary conditions are gradually switched on. We perform a detailed analysis of the computational costs of this strategy in the case of the four-dimensional $\mathrm{SU}(3)$ Yang-Mills theory. After achieving full control of the scaling, we outline a clear strategy to sample topology efficiently in the continuum limit, which we check at lattice spacings as small as $0.045$ fm. We also generalize this approach by designing a customized Stochastic Normalizing Flow for evolutions in the boundary conditions, obtaining superior performances with respect to the purely stochastic non-equilibrium approach, and paving the way for more efficient future flow-based solutions.
comment: 1+39 pages, 14 figures
☆ Convolutional Spiking-based GRU Cell for Spatio-temporal Data
Spike-based temporal messaging enables SNNs to efficiently process both purely temporal and spatio-temporal time-series or event-driven data. Combining SNNs with Gated Recurrent Units (GRUs), a variant of recurrent neural networks, gives rise to a robust framework for sequential data processing; however, traditional RNNs often lose local details when handling long sequences. Previous approaches, such as SpikGRU, fail to capture fine-grained local dependencies in event-based spatio-temporal data. In this paper, we introduce the Convolutional Spiking GRU (CS-GRU) cell, which leverages convolutional operations to preserve local structure and dependencies while integrating the temporal precision of spiking neurons with the efficient gating mechanisms of GRUs. This versatile architecture excels on both temporal datasets (NTIDIGITS, SHD) and spatio-temporal benchmarks (MNIST, DVSGesture, CIFAR10DVS). Our experiments show that CS-GRU outperforms state-of-the-art GRU variants by an average of 4.35%, achieving over 90% accuracy on sequential tasks and up to 99.31% on MNIST. It is worth noting that our solution achieves 69% higher efficiency compared to SpikGRU. The code is available at: https://github.com/YesmineAbdennadher/CS-GRU.
comment: 6 pages, 1 figure. Published in 2025 IEEE International Workshop On Machine Learning for Signal Processing, Aug. 31-Sep. 3, 2025, Istanbul, Turkey
☆ PyDPF: A Python Package for Differentiable Particle Filtering
State-space models (SSMs) are a widely used tool in time series analysis. In the complex systems that arise from real-world data, it is common to employ particle filtering (PF), an efficient Monte Carlo method for estimating the hidden state corresponding to a sequence of observations. Applying particle filtering requires specifying both the parametric form and the parameters of the system, which are often unknown and must be estimated. Gradient-based optimisation techniques cannot be applied directly to standard particle filters, as the filters themselves are not differentiable. However, several recently proposed methods modify the resampling step to make particle filtering differentiable. In this paper, we present an implementation of several such differentiable particle filters (DPFs) with a unified API built on the popular PyTorch framework. Our implementation makes these algorithms easily accessible to a broader research community and facilitates straightforward comparison between them. We validate our framework by reproducing experiments from several existing studies and demonstrate how DPFs can be applied to address several common challenges with state space modelling.
comment: 42 pages, 0 figures, under review at the Journal of Statistical Software, the python package can be found at https://pypi.org/project/pydpf/ , the full documentation at https://python-dpf.readthedocs.io/en/latest/#documentation-index , and the source code including experiments and replication material at https://github.com/John-JoB/pydpf
☆ A Configuration-First Framework for Reproducible, Low-Code Localization
Machine learning is increasingly permeating radio-based localization services. To keep results credible and comparable, everyday workflows should make rigorous experiment specification and exact repeatability the default, without blocking advanced experimentation. However, in practice, researchers face a three-way gap that could be filled by a framework that offers (i) low coding effort for end-to-end studies, (ii) reproducibility by default including versioned code, data, and configurations, controlled randomness, isolated runs, and recorded artifacts, and (iii) built-in extensibility so new models, metrics, and stages can be added with minimal integration effort. Existing tools rarely deliver all three for machine learning in general and localization workflows in particular. In this paper we introduce LOCALIZE, a low-code, configuration-first framework for radio localization in which experiments are declared in human-readable configuration, a workflow orchestrator runs standardized pipelines from data preparation to reporting, and all artifacts, such as datasets, models, metrics, and reports, are versioned. The preconfigured, versioned datasets reduce initial setup and boilerplate, speeding up model development and evaluation. The design, with clear extension points, allows experts to add components without reworking the infrastructure. In a qualitative comparison and a head-to-head study against a plain Jupyter notebook baseline, we show that the framework reduces authoring effort while maintaining comparable runtime and memory behavior. Furthermore, using a Bluetooth Low Energy dataset, we show that scaling across training data (1x to 10x) keeps orchestration overheads bounded as data grows. Overall, the framework makes reproducible machine-learning-based localization experimentation practical, accessible, and extensible.
comment: 20 pages, 7 figures. Preprint submitted to ACM Transactions on Software Engineering and Methodology (TOSEM), 2025
☆ Model Inversion Attacks Meet Cryptographic Fuzzy Extractors
Model inversion attacks pose an open challenge to privacy-sensitive applications that use machine learning (ML) models. For example, face authentication systems use modern ML models to compute embedding vectors from face images of the enrolled users and store them. If leaked, inversion attacks can accurately reconstruct user faces from the leaked vectors. There is no systematic characterization of properties needed in an ideal defense against model inversion, even for the canonical example application of a face authentication system susceptible to data breaches, despite a decade of best-effort solutions. In this paper, we formalize the desired properties of a provably strong defense against model inversion and connect it, for the first time, to the cryptographic concept of fuzzy extractors. We further show that existing fuzzy extractors are insecure for use in ML-based face authentication. We do so through a new model inversion attack called PIPE, which achieves a success rate of over 89% in most cases against prior schemes. We then propose L2FE-Hash, the first candidate fuzzy extractor which supports standard Euclidean distance comparators as needed in many ML-based applications, including face authentication. We formally characterize its computational security guarantees, even in the extreme threat model of full breach of stored secrets, and empirically show its usable accuracy in face authentication for practical face distributions. It offers attack-agnostic security without requiring any re-training of the ML model it protects. Empirically, it nullifies both prior state-of-the-art inversion attacks as well as our new PIPE attack.
☆ Graph Network-based Structural Simulator: Graph Neural Networks for Structural Dynamics
Graph Neural Networks (GNNs) have recently been explored as surrogate models for numerical simulations. While their applications in computational fluid dynamics have been investigated, little attention has been given to structural problems, especially for dynamic cases. To address this gap, we introduce the Graph Network-based Structural Simulator (GNSS), a GNN framework for surrogate modeling of dynamic structural problems. GNSS follows the encode-process-decode paradigm typical of GNN-based machine learning models, and its design makes it particularly suited for dynamic simulations thanks to three key features: (i) expressing node kinematics in node-fixed local frames, which avoids catastrophic cancellation in finite-difference velocities; (ii) employing a sign-aware regression loss, which reduces phase errors in long rollouts; and (iii) using a wavelength-informed connectivity radius, which optimizes graph construction. We evaluate GNSS on a case study involving a beam excited by a 50kHz Hanning-modulated pulse. The results show that GNSS accurately reproduces the physics of the problem over hundreds of timesteps and generalizes to unseen loading conditions, where existing GNNs fail to converge or deliver meaningful predictions. Compared with explicit finite element baselines, GNSS achieves substantial inference speedups while preserving spatial and temporal fidelity. These findings demonstrate that locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations.
comment: 16 pages, 14 figures
☆ Mechanistic Interpretability of RNNs emulating Hidden Markov Models NeurIPS 2025
Recurrent neural networks (RNNs) provide a powerful approach in neuroscience to infer latent dynamics in neural populations and to generate hypotheses about the neural computations underlying behavior. However, past work has focused on relatively simple, input-driven, and largely deterministic behaviors - little is known about the mechanisms that would allow RNNs to generate the richer, spontaneous, and potentially stochastic behaviors observed in natural settings. Modeling with Hidden Markov Models (HMMs) has revealed a segmentation of natural behaviors into discrete latent states with stochastic transitions between them, a type of dynamics that may appear at odds with the continuous state spaces implemented by RNNs. Here we first show that RNNs can replicate HMM emission statistics and then reverse-engineer the trained networks to uncover the mechanisms they implement. In the absence of inputs, the activity of trained RNNs collapses towards a single fixed point. When driven by stochastic input, trajectories instead exhibit noise-sustained dynamics along closed orbits. Rotation along these orbits modulates the emission probabilities and is governed by transitions between regions of slow, noise-driven dynamics connected by fast, deterministic transitions. The trained RNNs develop highly structured connectivity, with a small set of "kick neurons" initiating transitions between these regions. This mechanism emerges during training as the network shifts into a regime of stochastic resonance, enabling it to perform probabilistic computations. Analyses across multiple HMM architectures - fully connected, cyclic, and linear-chain - reveal that this solution generalizes through the modular reuse of the same dynamical motif, suggesting a compositional principle by which RNNs can emulate complex discrete latent dynamics.
comment: Accepted at NeurIPS 2025
☆ Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy NeurIPS 2025
A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top-$p$ structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart--Young--Mirsky theorem and explicitly capture interactions between a matrix $A \in \mathbb{R}^{n \times n}$ and an arbitrary symmetric perturbation $E$. Under mild eigengap and norm conditions, our bounds yield sharp estimates for $\|(A + E)_p - A_p\|$, where $A_p$ is the best rank-$p$ approximation of $A$, with improvements of up to a factor of $\sqrt{n}$. As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.
comment: NeurIPS 2025
☆ Subgraph Federated Learning via Spectral Methods NeurIPS
We consider the problem of federated learning (FL) with graph-structured data distributed across multiple clients. In particular, we address the prevalent scenario of interconnected subgraphs, where interconnections between clients significantly influence the learning process. Existing approaches suffer from critical limitations, either requiring the exchange of sensitive node embeddings, thereby posing privacy risks, or relying on computationally-intensive steps, which hinders scalability. To tackle these challenges, we propose FedLap, a novel framework that leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while ensuring privacy and scalability. We provide a formal analysis of the privacy of FedLap, demonstrating that it preserves privacy. Notably, FedLap is the first subgraph FL scheme with strong privacy guarantees. Extensive experiments on benchmark datasets demonstrate that FedLap achieves competitive or superior utility compared to existing techniques.
comment: To be presented at The Annual Conference on Neural Information Processing Systems (NeurIPS) 2025
☆ Continuous subsurface property retrieval from sparse radar observations using physics informed neural networks
Estimating subsurface dielectric properties is essential for applications ranging from environmental surveys of soils to nondestructive evaluation of concrete in infrastructure. Conventional wave inversion methods typically assume few discrete homogeneous layers and require dense measurements or strong prior knowledge of material boundaries, limiting scalability and accuracy in realistic settings where properties vary continuously. We present a physics informed machine learning framework that reconstructs subsurface permittivity as a fully neural, continuous function of depth, trained to satisfy both measurement data and Maxwells equations. We validate the framework with both simulations and custom built radar experiments on multilayered natural materials. Results show close agreement with in-situ permittivity measurements (R^2=0.93), with sensitivity to even subtle variations (Delta eps_r=2). Parametric analysis reveals that accurate profiles can be recovered with as few as three strategically placed sensors in two layer systems. This approach reframes subsurface inversion from boundary-driven to continuous property estimation, enabling accurate characterization of smooth permittivity variations and advancing electromagnetic imaging using low cost radar systems.
comment: 22 pages, 9 main text figures + 2 supplementary figures
☆ Are Language Models Efficient Reasoners? A Perspective from Logic Programming NeurIPS 2025
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
comment: Accepted to NeurIPS 2025
☆ Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
comment: 13 pages, 6 figures
☆ BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training
We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.
☆ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
☆ Uncertainty Quantification for Regression: A Unified Framework based on kernel scores
Regression tasks, notably in safety-critical domains, require proper uncertainty quantification, yet the literature remains largely classification-focused. In this light, we introduce a family of measures for total, aleatoric, and epistemic uncertainty based on proper scoring rules, with a particular emphasis on kernel scores. The framework unifies several well-known measures and provides a principled recipe for designing new ones whose behavior, such as tail sensitivity, robustness, and out-of-distribution responsiveness, is governed by the choice of kernel. We prove explicit correspondences between kernel-score characteristics and downstream behavior, yielding concrete design guidelines for task-specific measures. Extensive experiments demonstrate that these measures are effective in downstream tasks and reveal clear trade-offs among instantiations, including robustness and out-of-distribution detection performance.
☆ Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
☆ Generalized Sobolev IPM for Graph-Based Measures
We study the Sobolev IPM problem for measures supported on a graph metric space, where critic function is constrained to lie within the unit ball defined by Sobolev norm. While Le et al. (2025) achieved scalable computation by relating Sobolev norm to weighted $L^p$-norm, the resulting framework remains intrinsically bound to $L^p$ geometric structure, limiting its ability to incorporate alternative structural priors beyond the $L^p$ geometry paradigm. To overcome this limitation, we propose to generalize Sobolev IPM through the lens of \emph{Orlicz geometric structure}, which employs convex functions to capture nuanced geometric relationships, building upon recent advances in optimal transport theory -- particularly Orlicz-Wasserstein (OW) and generalized Sobolev transport -- that have proven instrumental in advancing machine learning methodologies. This generalization encompasses classical Sobolev IPM as a special case while accommodating diverse geometric priors beyond traditional $L^p$ structure. It however brings up significant computational hurdles that compound those already inherent in Sobolev IPM. To address these challenges, we establish a novel theoretical connection between Orlicz-Sobolev norm and Musielak norm which facilitates a novel regularization for the generalized Sobolev IPM (GSI). By further exploiting the underlying graph structure, we show that GSI with Musielak regularization (GSI-M) reduces to a simple \emph{univariate optimization} problem, achieving remarkably computational efficiency. Empirically, GSI-M is several-order faster than the popular OW in computation, and demonstrates its practical advantages in comparing probability measures on a given graph for document classification and several tasks in topological data analysis.
☆ Learning-Augmented Online Bidding in Stochastic Settings
Online bidding is a classic optimization problem, with several applications in online decision-making, the design of interruptible systems, and the analysis of approximation algorithms. In this work, we study online bidding under learning-augmented settings that incorporate stochasticity, in either the prediction oracle or the algorithm itself. In the first part, we study bidding under distributional predictions, and find Pareto-optimal algorithms that offer the best-possible tradeoff between the consistency and the robustness of the algorithm. In the second part, we study the power and limitations of randomized bidding algorithms, by presenting upper and lower bounds on the consistency/robustness tradeoffs. Previous works focused predominantly on oracles that do not leverage stochastic information on the quality of the prediction, and deterministic algorithms.
☆ Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification
Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.
☆ Perturbation Bounds for Low-Rank Inverse Approximations under Noise NeurIPS 2025
Low-rank pseudoinverses are widely used to approximate matrix inverses in scalable machine learning, optimization, and scientific computing. However, real-world matrices are often observed with noise, arising from sampling, sketching, and quantization. The spectral-norm robustness of low-rank inverse approximations remains poorly understood. We systematically study the spectral-norm error $\| (\tilde{A}^{-1})_p - A_p^{-1} \|$ for an $n\times n$ symmetric matrix $A$, where $A_p^{-1}$ denotes the best rank-\(p\) approximation of $A^{-1}$, and $\tilde{A} = A + E$ is a noisy observation. Under mild assumptions on the noise, we derive sharp non-asymptotic perturbation bounds that reveal how the error scales with the eigengap, spectral decay, and noise alignment with low-curvature directions of $A$. Our analysis introduces a novel application of contour integral techniques to the \emph{non-entire} function $f(z) = 1/z$, yielding bounds that improve over naive adaptations of classical full-inverse bounds by up to a factor of $\sqrt{n}$. Empirically, our bounds closely track the true perturbation error across a variety of real-world and synthetic matrices, while estimates based on classical results tend to significantly overpredict. These findings offer practical, spectrum-aware guarantees for low-rank inverse approximations in noisy computational environments.
comment: NeurIPS 2025
☆ A Framework for Bounding Deterministic Risk with PAC-Bayes: Applications to Majority Votes
PAC-Bayes is a popular and efficient framework for obtaining generalization guarantees in situations involving uncountable hypothesis spaces. Unfortunately, in its classical formulation, it only provides guarantees on the expected risk of a randomly sampled hypothesis. This requires stochastic predictions at test time, making PAC-Bayes unusable in many practical situations where a single deterministic hypothesis must be deployed. We propose a unified framework to extract guarantees holding for a single hypothesis from stochastic PAC-Bayesian guarantees. We present a general oracle bound and derive from it a numerical bound and a specialization to majority vote. We empirically show that our approach consistently outperforms popular baselines (by up to a factor of 2) when it comes to generalization bounds on deterministic classifiers.
☆ PitchFlower: A flow-based neural audio codec with pitch controllability
We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
comment: 5 pages, 5 figures
☆ Leveraging an Atmospheric Foundational Model for Subregional Sea Surface Temperature Forecasting
The accurate prediction of oceanographic variables is crucial for understanding climate change, managing marine resources, and optimizing maritime activities. Traditional ocean forecasting relies on numerical models; however, these approaches face limitations in terms of computational cost and scalability. In this study, we adapt Aurora, a foundational deep learning model originally designed for atmospheric forecasting, to predict sea surface temperature (SST) in the Canary Upwelling System. By fine-tuning this model with high-resolution oceanographic reanalysis data, we demonstrate its ability to capture complex spatiotemporal patterns while reducing computational demands. Our methodology involves a staged fine-tuning process, incorporating latitude-weighted error metrics and optimizing hyperparameters for efficient learning. The experimental results show that the model achieves a low RMSE of 0.119K, maintaining high anomaly correlation coefficients (ACC $\approx 0.997$). The model successfully reproduces large-scale SST structures but faces challenges in capturing finer details in coastal regions. This work contributes to the field of data-driven ocean forecasting by demonstrating the feasibility of using deep learning models pre-trained in different domains for oceanic applications. Future improvements include integrating additional oceanographic variables, increasing spatial resolution, and exploring physics-informed neural networks to enhance interpretability and understanding. These advancements can improve climate modeling and ocean prediction accuracy, supporting decision-making in environmental and economic sectors.
comment: 18 pages, 9 figures
☆ Hybrid Quantum-Classical Recurrent Neural Networks
We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.
☆ Robust variable selection for spatial point processes observed with noise
We propose a method for variable selection in the intensity function of spatial point processes that combines sparsity-promoting estimation with noise-robust model selection. As high-resolution spatial data becomes increasingly available through remote sensing and automated image analysis, identifying spatial covariates that influence the localization of events is crucial to understand the underlying mechanism. However, results from automated acquisition techniques are often noisy, for example due to measurement uncertainties or detection errors, which leads to spurious displacements and missed events. We study the impact of such noise on sparse point-process estimation across different models, including Poisson and Thomas processes. To improve noise robustness, we propose to use stability selection based on point-process subsampling and to incorporate a non-convex best-subset penalty to enhance model-selection performance. In extensive simulations, we demonstrate that such an approach reliably recovers true covariates under diverse noise scenarios and improves both selection accuracy and stability. We then apply the proposed method to a forestry data set, analyzing the distribution of trees in relation to elevation and soil nutrients in a tropical rain forest. This shows the practical utility of the method, which provides a systematic framework for robust variable selection in spatial point-process models under noise, without requiring additional knowledge of the process.
☆ Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations
Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations to reduce the computational cost of popular Auto-Regressive Models (ARMs), at the price of some bias in the sampling distribution. We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy) that depend only on the average number of tokens generated per iteration and are independent of the data dimensionality (i.e. sequence length), thus supporting the empirical success of MDMs. We then investigate the gain obtained by using non-constant schedule sizes (i.e. varying the number of unmasked tokens during the generation process) and identify the optimal schedule as a function of a so-called information profile of the data distribution, thus allowing for a principled optimization of schedule sizes. We define methods directly as sampling algorithms and do not use classical derivations as time-reversed diffusion processes, leading us to simple and transparent proofs.
Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.
☆ Using latent representations to link disjoint longitudinal data for mixed-effects regression
Many rare diseases offer limited established treatment options, leading patients to switch therapies when new medications emerge. To analyze the impact of such treatment switches within the low sample size limitations of rare disease trials, it is important to use all available data sources. This, however, is complicated when usage of measurement instruments change during the observation period, for example when instruments are adapted to specific age ranges. The resulting disjoint longitudinal data trajectories, complicate the application of traditional modeling approaches like mixed-effects regression. We tackle this by mapping observations of each instrument to a aligned low-dimensional temporal trajectory, enabling longitudinal modeling across instruments. Specifically, we employ a set of variational autoencoder architectures to embed item values into a shared latent space for each time point. Temporal disease dynamics and treatment switch effects are then captured through a mixed-effects regression model applied to latent representations. To enable statistical inference, we present a novel statistical testing approach that accounts for the joint parameter estimation of mixed-effects regression and variational autoencoders. The methodology is applied to quantify the impact of treatment switches for patients with spinal muscular atrophy. Here, our approach aligns motor performance items from different measurement instruments for mixed-effects regression and maps estimated effects back to the observed item level to quantify the treatment switch effect. Our approach allows for model selection as well as for assessing effects of treatment switching. The results highlight the potential of modeling in joint latent representations for addressing small data challenges.
comment: 31 pages, 3 figures, 3 tables
☆ Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains
We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. This improves upon known results in the literature that state that convergence holds for a sufficiently small discount factor by establishing an explicit bound. Convergence is with probability one and achieves projected Bellman error equal to zero. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our results using different types of reversible Markov chains, such as one-dimensional random walks and random walks on a weighted graph.
☆ FaCT: Faithful Concept Traces for Explaining Neural Network Decisions NeurIPS 2025
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.
comment: Accepted to NeurIPS 2025; Code is available at https://github.com/m-parchami/FaCT
☆ Support Vector Machine-Based Burnout Risk Prediction with an Interactive Interface for Organizational Use
Burnout is a psychological syndrome marked by emotional exhaustion, depersonalization, and reduced personal accomplishment, with a significant impact on individual well-being and organizational performance. This study proposes a machine learning approach to predict burnout risk using the HackerEarth Employee Burnout Challenge dataset. Three supervised algorithms were evaluated: nearest neighbors (KNN), random forest, and support vector machine (SVM), with model performance evaluated through 30-fold cross-validation using the determination coefficient (R2). Among the models tested, SVM achieved the highest predictive performance (R2 = 0.84) and was statistically superior to KNN and Random Forest based on paired $t$-tests. To ensure practical applicability, an interactive interface was developed using Streamlit, allowing non-technical users to input data and receive burnout risk predictions. The results highlight the potential of machine learning to support early detection of burnout and promote data-driven mental health strategies in organizational settings.
comment: 12 pages, including figures and references. Streamlit app available at: https://employee-burnout-svm.streamlit.app/
☆ TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
comment: 30 pages, 18 figures, 13 tables
☆ Right for the Right Reasons: Avoiding Reasoning Shortcuts via Prototypical Neurosymbolic AI
Neurosymbolic AI is growing in popularity thanks to its ability to combine neural perception and symbolic reasoning in end-to-end trainable models. However, recent findings reveal these are prone to shortcut reasoning, i.e., to learning unindented concepts--or neural predicates--which exploit spurious correlations to satisfy the symbolic constraints. In this paper, we address reasoning shortcuts at their root cause and we introduce prototypical neurosymbolic architectures. These models are able to satisfy the symbolic constraints (be right) because they have learnt the correct basic concepts (for the right reasons) and not because of spurious correlations, even in extremely low data regimes. Leveraging the theory of prototypical learning, we demonstrate that we can effectively avoid reasoning shortcuts by training the models to satisfy the background knowledge while taking into account the similarity of the input with respect to the handful of labelled datapoints. We extensively validate our approach on the recently proposed rsbench benchmark suite in a variety of settings and tasks with very scarce supervision: we show significant improvements in learning the right concepts both in synthetic tasks (MNIST-EvenOdd and Kand-Logic) and real-world, high-stake ones (BDD-OIA). Our findings pave the way to prototype grounding as an effective, annotation-efficient strategy for safe and reliable neurosymbolic learning.
☆ Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks NeurIPS 2025
Robust validation metrics remain essential in contemporary deep learning, not only to detect overfitting and poor generalization, but also to monitor training dynamics. In the supervised classification setting, we investigate whether interactions between training data and model weights can yield such a metric that both tracks generalization during training and attributes performance to individual training samples. We introduce Gradient-Weight Alignment (GWA), quantifying the coherence between per-sample gradients and model weights. We show that effective learning corresponds to coherent alignment, while misalignment indicates deteriorating generalization. GWA is efficiently computable during training and reflects both sample-specific contributions and dataset-wide learning dynamics. Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ An In-Depth Analysis of Cyber Attacks in Secured Platforms
There is an increase in global malware threats. To address this, an encryption-type ransomware has been introduced on the Android operating system. The challenges associated with malicious threats in phone use have become a pressing issue in mobile communication, disrupting user experiences and posing significant privacy threats. This study surveys commonly used machine learning techniques for detecting malicious threats in phones and examines their performance. The majority of past research focuses on customer feedback and reviews, with concerns that people might create false reviews to promote or devalue products and services for personal gain. Hence, the development of techniques for detecting malicious threats using machine learning has been a key focus. This paper presents a comprehensive comparative study of current research on the issue of malicious threats and methods for tackling these challenges. Nevertheless, a huge amount of information is required by these methods, presenting a challenge for developing robust, specialized automated anti-malware systems. This research describes the Android Applications dataset, and the accuracy of the techniques is measured using the accuracy levels of the metrics employed in this study.
☆ Scalable Utility-Aware Multiclass Calibration
Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.
☆ Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions
Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models -- a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the Symbolic/Classical (relying on algorithmic planning and persistent state) and the Neural/Generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018--2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.
GPTOpt: Towards Efficient LLM-Based Black-Box Optimization
Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency. Classical methods such as Bayesian Optimization (BO) can be effective, but they often require careful parameter tuning to each application domain. At the same time, Large Language Models (LLMs) have shown broad capabilities, yet state-of-the-art models remain limited in solving continuous black-box optimization tasks. We introduce GPTOpt, an LLM-based optimization method that equips LLMs with continuous black-box optimization capabilities. By fine-tuning large language models on extensive synthetic datasets derived from diverse BO parameterizations, GPTOpt leverages LLM pre-training to generalize across optimization tasks. On a variety of black-box optimization benchmarks, GPTOpt surpasses traditional optimizers, highlighting the capacity of LLMs for advanced numerical reasoning and introducing a flexible framework for global optimization without parameter tuning.
☆ A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory
While many problems in machine learning focus on learning mappings between finite-dimensional spaces, scientific applications require approximating mappings between function spaces, i.e., operators. We study the problem of learning collections of operators and provide both theoretical and empirical advances. We distinguish between two regimes: (i) multiple operator learning, where a single network represents a continuum of operators parameterized by a parametric function, and (ii) learning several distinct single operators, where each operator is learned independently. For the multiple operator case, we introduce two new architectures, $\mathrm{MNO}$ and $\mathrm{MONet}$, and establish universal approximation results in three settings: continuous, integrable, or Lipschitz operators. For the latter, we further derive explicit scaling laws that quantify how the network size must grow to achieve a target approximation accuracy. For learning several single operators, we develop a framework for balancing architectural complexity across subnetworks and show how approximation order determines computational efficiency. Empirical experiments on parametric PDE benchmarks confirm the strong expressive power and efficiency of the proposed architectures. Overall, this work establishes a unified theoretical and practical foundation for scalable neural operator learning across multiple operators.
Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
☆ Position: Biology is the Challenge Physics-Informed ML Needs to Evolve
Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws. This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints. Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks. In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology. Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge. We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling. We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.
☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award)
☆ Parameter Averaging in Link Prediction
Ensemble methods are widely employed to improve generalization in machine learning. This has also prompted the adoption of ensemble learning for the knowledge graph embedding (KGE) models in performing link prediction. Typical approaches to this end train multiple models as part of the ensemble, and the diverse predictions are then averaged. However, this approach has some significant drawbacks. For instance, the computational overhead of training multiple models increases latency and memory overhead. In contrast, model merging approaches offer a promising alternative that does not require training multiple models. In this work, we introduce model merging, specifically weighted averaging, in KGE models. Herein, a running average of model parameters from a training epoch onward is maintained and used for predictions. To address this, we additionally propose an approach that selectively updates the running average of the ensemble model parameters only when the generalization performance improves on a validation dataset. We evaluate these two different weighted averaging approaches on link prediction tasks, comparing the state-of-the-art benchmark ensemble approach. Additionally, we evaluate the weighted averaging approach considering literal-augmented KGE models and multi-hop query answering tasks as well. The results demonstrate that the proposed weighted averaging approach consistently improves performance across diverse evaluation settings.
☆ Analysis of Semi-Supervised Learning on Hypergraphs
Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted $p$-Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.
☆ Beyond Leakage and Complexity: Towards Realistic and Efficient Information Cascade Prediction
Information cascade popularity prediction is a key problem in analyzing content diffusion in social networks. However, current related works suffer from three critical limitations: (1) temporal leakage in current evaluation--random cascade-based splits allow models to access future information, yielding unrealistic results; (2) feature-poor datasets that lack downstream conversion signals (e.g., likes, comments, or purchases), which limits more practical applications; (3) computational inefficiency of complex graph-based methods that require days of training for marginal gains. We systematically address these challenges from three perspectives: task setup, dataset construction, and model design. First, we propose a time-ordered splitting strategy that chronologically partitions data into consecutive windows, ensuring models are evaluated on genuine forecasting tasks without future information leakage. Second, we introduce Taoke, a large-scale e-commerce cascade dataset featuring rich promoter/product attributes and ground-truth purchase conversions--capturing the complete diffusion lifecycle from promotion to monetization. Third, we develop CasTemp, a lightweight framework that efficiently models cascade dynamics through temporal walks, Jaccard-based neighbor selection for inter-cascade dependencies, and GRU-based encoding with time-aware attention. Under leak-free evaluation, CasTemp achieves state-of-the-art performance across four datasets with orders-of-magnitude speedup. Notably, it excels at predicting second-stage popularity conversions--a practical task critical for real-world applications.
☆ 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework MICCAI
Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p<0.05), despite the unavailability of expert annotations.
comment: 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to be published in Volume 16206 of the Lecture Notes in Computer Science series
☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.
comment: Accepted by SenSys 2026
☆ CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices NeurIPS 2025
Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.
comment: Accepted at NeurIPS 2025. Camera-ready version. 10 pages, 12 figures, 2 tables
☆ Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning
Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.
comment: 21 pages, 5 figures
☆ Hierarchical Physics-Embedded Learning for Spatiotemporal Dynamical Systems
Modeling complex spatiotemporal dynamics, particularly in far-from-equilibrium systems, remains a grand challenge in science. The governing partial differential equations (PDEs) for these systems are often intractable to derive from first principles, due to their inherent complexity, characterized by high-order derivatives and strong nonlinearities, coupled with incomplete physical knowledge. This has spurred the development of data-driven methods, yet these approaches face limitations: Purely data-driven models are often physically inconsistent and data-intensive, while existing physics-informed methods lack the structural capacity to represent complex operators or systematically integrate partial physical knowledge. Here, we propose a hierarchical physics-embedded learning framework that fundamentally advances both the forward spatiotemporal prediction and inverse discovery of physical laws from sparse and noisy data. The key innovation is a two-level architecture that mirrors the process of scientific discovery: the first level learns fundamental symbolic components of a PDE, while the second learns their governing combinations. This hierarchical decomposition not only reduces learning complexity but, more importantly, enables a structural integration of prior knowledge. Known physical laws are directly embedded into the models computational graph, guaranteeing physical consistency and improving data efficiency. By building the framework upon adaptive Fourier Neural Operators, we can effectively capture the non-local dependencies and high-order operators characteristic of dynamical systems. Additionally, by structurally decoupling known and unknown terms, the framework further enables interpretable discovery of underlying governing equations through symbolic regression, without presupposing functional forms.
☆ On the Stability of Neural Networks in Deep Learning
Deep learning has achieved remarkable success across a wide range of tasks, but its models often suffer from instability and vulnerability: small changes to the input may drastically affect predictions, while optimization can be hindered by sharp loss landscapes. This thesis addresses these issues through the unifying perspective of sensitivity analysis, which examines how neural networks respond to perturbations at both the input and parameter levels. We study Lipschitz networks as a principled way to constrain sensitivity to input perturbations, thereby improving generalization, adversarial robustness, and training stability. To complement this architectural approach, we introduce regularization techniques based on the curvature of the loss function, promoting smoother optimization landscapes and reducing sensitivity to parameter variations. Randomized smoothing is also explored as a probabilistic method for enhancing robustness at decision boundaries. By combining these perspectives, we develop a unified framework where Lipschitz continuity, randomized smoothing, and curvature regularization interact to address fundamental challenges in stability. The thesis contributes both theoretical analysis and practical methodologies, including efficient spectral norm computation, novel Lipschitz-constrained layers, and improved certification procedures.
☆ IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning
Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.
☆ TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation NeurIPS 2025
Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
comment: The 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Scaling Up Bayesian DAG Sampling
Bayesian inference of Bayesian network structures is often performed by sampling directed acyclic graphs along an appropriately constructed Markov chain. We present two techniques to improve sampling. First, we give an efficient implementation of basic moves, which add, delete, or reverse a single arc. Second, we expedite summing over parent sets, an expensive task required for more sophisticated moves: we devise a preprocessing method to prune possible parent sets so as to approximately preserve the sums. Our empirical study shows that our techniques can yield substantial efficiency gains compared to previous methods.
☆ BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training
Recent studies \citep{gur2018gradient,song2024does, wen2024understanding} highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress. In this work, we further advance the understanding of this phenomenon and introduce the \textbf{Bulk-Space-Filtration-Accelerator (BSFA)}, a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space. To ensure BSFA is both practical and scalable for contemporary large models, we introduce two key innovations: an efficient estimator using Principal Component Analysis (PCA) on historical updates for fast subspace estimation, and a block-wise strategy that applies this estimation on a per-parameter-block basis. These designs make BSFA computationally tractable and highly effective. We demonstrate BSFA's acceleration across various tasks, notably achieving approximately 2$\times$ speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.
comment: 16 pages
☆ Generative Bayesian Optimization: Generative Models as Acquisition Functions
We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of non-continuous design spaces, and high-dimensional and combinatorial design. Inspired by the success of direct preference optimization (DPO), we show that one can train a generative model with noisy, simple utility values directly computed from observations to then form proposal distributions whose densities are proportional to the expected utility, i.e., BO's acquisition function values. Furthermore, this approach is generalizable beyond preference-based feedback to general types of reward signals and loss functions. This perspective avoids the construction of surrogate (regression or classification) models, common in previous methods that have used generative models for black-box optimization. Theoretically, we show that the generative models within the BO process approximately follow a sequence of distributions which asymptotically concentrate at the global optima under certain conditions. We also demonstrate this effect through experiments on challenging optimization problems involving large batches in high dimensions.
comment: Under review
☆ Cost-Sensitive Unbiased Risk Estimation for Multi-Class Positive-Unlabeled Learning
Positive--Unlabeled (PU) learning considers settings in which only positive and unlabeled data are available, while negatives are missing or left unlabeled. This situation is common in real applications where annotating reliable negatives is difficult or costly. Despite substantial progress in PU learning, the multi-class case (MPU) remains challenging: many existing approaches do not ensure \emph{unbiased risk estimation}, which limits performance and stability. We propose a cost-sensitive multi-class PU method based on \emph{adaptive loss weighting}. Within the empirical risk minimization framework, we assign distinct, data-dependent weights to the positive and \emph{inferred-negative} (from the unlabeled mixture) loss components so that the resulting empirical objective is an unbiased estimator of the target risk. We formalize the MPU data-generating process and establish a generalization error bound for the proposed estimator. Extensive experiments on \textbf{eight} public datasets, spanning varying class priors and numbers of classes, show consistent gains over strong baselines in both accuracy and stability.
☆ GReF: A Unified Generative Framework for Efficient Reranking via Ordered Multi-token Prediction CIKM 2025
In a multi-stage recommendation system, reranking plays a crucial role in modeling intra-list correlations among items. A key challenge lies in exploring optimal sequences within the combinatorial space of permutations. Recent research follows a two-stage (generator-evaluator) paradigm, where a generator produces multiple feasible sequences, and an evaluator selects the best one. In practice, the generator is typically implemented as an autoregressive model. However, these two-stage methods face two main challenges. First, the separation of the generator and evaluator hinders end-to-end training. Second, autoregressive generators suffer from inference efficiency. In this work, we propose a Unified Generative Efficient Reranking Framework (GReF) to address the two primary challenges. Specifically, we introduce Gen-Reranker, an autoregressive generator featuring a bidirectional encoder and a dynamic autoregressive decoder to generate causal reranking sequences. Subsequently, we pre-train Gen-Reranker on the item exposure order for high-quality parameter initialization. To eliminate the need for the evaluator while integrating sequence-level evaluation during training for end-to-end optimization, we propose post-training the model through Rerank-DPO. Moreover, for efficient autoregressive inference, we introduce ordered multi-token prediction (OMTP), which trains Gen-Reranker to simultaneously generate multiple future items while preserving their order, ensuring practical deployment in real-time recommender systems. Extensive offline experiments demonstrate that GReF outperforms state-of-the-art reranking methods while achieving latency that is nearly comparable to non-autoregressive models. Additionally, GReF has also been deployed in a real-world video app Kuaishou with over 300 million daily active users, significantly improving online recommendation quality.
comment: Accepted by CIKM 2025
☆ Selective Learning for Deep Time Series Forecasting NeurIPS 2025
Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.
comment: Accepted by NeurIPS 2025
☆ RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models
Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
comment: 17 pages, 11 figures
☆ Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers
In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than $50\%$.
comment: EAAI Journal
☆ A Study on Inference Latency for Vision Transformers on Mobile Devices
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
comment: To appear in Springer LNICST, volume 663, Proceedings of VALUETOOLS 2024
☆ Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk
To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.
☆ Conditional neural field for spatial dimension reduction of turbulence data: a comparison study
We investigate conditional neural fields (CNFs), mesh-agnostic, coordinate-based decoders conditioned on a low-dimensional latent, for spatial dimensionality reduction of turbulent flows. CNFs are benchmarked against Proper Orthogonal Decomposition and a convolutional autoencoder within a unified encoding-decoding framework and a common evaluation protocol that explicitly separates in-range (interpolative) from out-of-range (strict extrapolative) testing beyond the training horizon, with identical preprocessing, metrics, and fixed splits across all baselines. We examine three conditioning mechanisms: (i) activation-only modulation (often termed FiLM), (ii) low-rank weight and bias modulation (termed FP), and (iii) last-layer inner-product coupling, and introduce a novel domain-decomposed CNF that localizes complexities. Across representative turbulence datasets (WMLES channel inflow, DNS channel inflow, and wall pressure fluctuations over turbulent boundary layers), CNF-FP achieves the lowest training and in-range testing errors, while CNF-FiLM generalizes best for out-of-range scenarios once moderate latent capacity is available. Domain decomposition significantly improves out-of-range accuracy, especially for the more demanding datasets. The study provides a rigorous, physics-aware basis for selecting conditioning, capacity, and domain decomposition when using CNFs for turbulence compression and reconstruction.
☆ EnzyControl: Adding Functional and Substrate-Specific Control for Enzyme Backbone Generation
Designing enzyme backbones with substrate-specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate-specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme-substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate-specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA-annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme-substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif-scaffolding model, allowing it to become substrate-aware. A two-stage training paradigm further refines the model's ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13\% in designability and 13\% in catalytic efficiency compared to the baseline models. The code is released at https://github.com/Vecteur-libre/EnzyControl.
☆ Lipschitz-aware Linearity Grafting for Certified Robustness
Lipschitz constant is a fundamental property in certified robustness, as smaller values imply robustness to adversarial examples when a model is confident in its prediction. However, identifying the worst-case adversarial examples is known to be an NP-complete problem. Although over-approximation methods have shown success in neural network verification to address this challenge, reducing approximation errors remains a significant obstacle. Furthermore, these approximation errors hinder the ability to obtain tight local Lipschitz constants, which are crucial for certified robustness. Originally, grafting linearity into non-linear activation functions was proposed to reduce the number of unstable neurons, enabling scalable and complete verification. However, no prior theoretical analysis has explained how linearity grafting improves certified robustness. We instead consider linearity grafting primarily as a means of eliminating approximation errors rather than reducing the number of unstable neurons, since linear functions do not require relaxation. In this paper, we provide two theoretical contributions: 1) why linearity grafting improves certified robustness through the lens of the $l_\infty$ local Lipschitz constant, and 2) grafting linearity into non-linear activation functions, the dominant source of approximation errors, yields a tighter local Lipschitz constant. Based on these theoretical contributions, we propose a Lipschitz-aware linearity grafting method that removes dominant approximation errors, which are crucial for tightening the local Lipschitz constant, thereby improving certified robustness, even without certified training. Our extensive experiments demonstrate that grafting linearity into these influential activations tightens the $l_\infty$ local Lipschitz constant and enhances certified robustness.
☆ An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation NeurIPS 2025
The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) -- sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.
comment: Accepted at NeurIPS 2025
☆ Bridging the Divide: End-to-End Sequence-Graph Learning
Many real-world datasets are both sequential and relational: each node carries an event sequence while edges encode interactions. Existing methods in sequence modeling and graph modeling often neglect one modality or the other. We argue that sequences and graphs are not separate problems but complementary facets of the same dataset, and should be learned jointly. We introduce BRIDGE, a unified end-to-end architecture that couples a sequence encoder with a GNN under a single objective, allowing gradients to flow across both modules and learning task-aligned representations. To enable fine-grained token-level message passing among neighbors, we add TOKENXATTN, a token-level cross-attention layer that passes messages between events in neighboring sequences. Across two settings, friendship prediction (Brightkite) and fraud detection (Amazon), BRIDGE consistently outperforms static GNNs, temporal graph methods, and sequence-only baselines on ranking and classification metrics.
☆ Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data
We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.
comment: 41 pages, 18 figures
☆ A Unified Bilevel Model for Adversarial Learning and A Case Study
Adversarial learning has been attracting more and more attention thanks to the fast development of machine learning and artificial intelligence. However, due to the complicated structure of most machine learning models, the mechanism of adversarial attacks is not well interpreted. How to measure the effect of attack is still not quite clear. In this paper, we propose a unified bilevel model for adversarial learning. We further investigate the adversarial attack in clustering models and interpret it from data perturbation point of view. We reveal that when the data perturbation is relatively small, the clustering model is robust, whereas if it is relatively large, the clustering result changes, which leads to an attack. To measure the effect of attacks for clustering models, we analyse the well-definedness of the so-called $\delta$-measure, which can be used in the proposed bilevel model for adversarial learning of clustering models.
☆ Energy Approach from $\varepsilon$-Graph to Continuum Diffusion Model with Connectivity Functional
We derive an energy-based continuum limit for $\varepsilon$-graphs endowed with a general connectivity functional. We prove that the discrete energy and its continuum counterpart differ by at most $O(\varepsilon)$; the prefactor involves only the $W^{1,1}$-norm of the connectivity density as $\varepsilon\to0$, so the error bound remains valid even when that density has strong local fluctuations. As an application, we introduce a neural-network procedure that reconstructs the connectivity density from edge-weight data and then embeds the resulting continuum model into a brain-dynamics framework. In this setting, the usual constant diffusion coefficient is replaced by the spatially varying coefficient produced by the learned density, yielding dynamics that differ significantly from those obtained with conventional constant-diffusion models.
☆ The Neural Differential Manifold: An Architecture with Explicit Geometric Structure
This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold's metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.
comment: 9 pages
☆ Shift is Good: Mismatched Data Mixing Improves Test Performance
We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component "skills'' at training and test.
☆ Learning Fair Graph Representations with Multi-view Information Bottleneck
Graph neural networks (GNNs) excel on relational data by passing messages over node features and structure, but they can amplify training data biases, propagating discriminatory attributes and structural imbalances into unfair outcomes. Many fairness methods treat bias as a single source, ignoring distinct attribute and structure effects and leading to suboptimal fairness and utility trade-offs. To overcome this challenge, we propose FairMIB, a multi-view information bottleneck framework designed to decompose graphs into feature, structural, and diffusion views for mitigating complexity biases in GNNs. Especially, the proposed FairMIB employs contrastive learning to maximize cross-view mutual information for bias-free representation learning. It further integrates multi-perspective conditional information bottleneck objectives to balance task utility and fairness by minimizing mutual information with sensitive attributes. Additionally, FairMIB introduces an inverse probability-weighted (IPW) adjacency correction in the diffusion view, which reduces the spread of bias propagation during message passing. Experiments on five real-world benchmark datasets demonstrate that FairMIB achieves state-of-the-art performance across both utility and fairness metrics.
☆ Continual Low-Rank Adapters for LLM-based Generative Recommender Systems
While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.
☆ BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under uncertainty, with real-world analogues in negotiation, finance, and cybersecurity. Typically, these games fall into three categories based on the flow of control: strictly-sequential (where players alternate single actions), deterministic-response (where some actions trigger a fixed outcome), and unbounded reciprocal-response (where alternating counterplays are permitted). A less-explored but strategically rich structure exists: the bounded one-sided response. This dynamic occurs when a player's action briefly transfers control to the opponent, who must satisfy a fixed condition through one or more sequential moves before the turn resolves. We term games featuring this mechanism Bounded One-Sided Response Games (BORGs). We introduce a modified version of Monopoly Deal as a benchmark environment that specifically isolates the BORG dynamic, where a Rent action forces the opponent to sequentially choose payment assets. We demonstrate that the gold-standard algorithm, Counterfactual Regret Minimization (CFR), successfully converges on effective strategies for this domain without requiring novel algorithmic extensions. To support efficient, reproducible experimentation, we present a lightweight, full-stack research platform that unifies the environment, a parallelized CFR runtime, and a human-playable web interface, all runnable on a single workstation. This system provides a practical foundation for exploring state representation and policy learning in bounded one-sided response settings. The trained CFR agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
☆ Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs
We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs. BOND demonstrates improved accuracy and scalability from existing perturbative methods, enabling new explorations of trainable architectures that integrate black-box functions. We observe that these black-box functions, realized in our experiments as fixed, untrained networks, can enhance model performance without increasing the number of trainable parameters. This improvement is achieved without extensive optimization of the architecture or properties of the black-box function itself. Our findings highlight the potential of leveraging fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices as a mechanism for scaling networks.
comment: 12 pages main, Appendix 10 pages, 6 figures in main body, 10 overall
☆ Nonlinear Dynamics In Optimization Landscape of Shallow Neural Networks with Tunable Leaky ReLU
In this work, we study the nonlinear dynamics of a shallow neural network trained with mean-squared loss and leaky ReLU activation. Under Gaussian inputs and equal layer width k, (1) we establish, based on the equivariant gradient degree, a theoretical framework, applicable to any number of neurons k>= 4, to detect bifurcation of critical points with associated symmetries from global minimum as leaky parameter $\alpha$ varies. Typically, our analysis reveals that a multi-mode degeneracy consistently occurs at the critical number 0, independent of k. (2) As a by-product, we further show that such bifurcations are width-independent, arise only for nonnegative $\alpha$ and that the global minimum undergoes no further symmetry-breaking instability throughout the engineering regime $\alpha$ in range (0,1). An explicit example with k=5 is presented to illustrate the framework and exhibit the resulting bifurcation together with their symmetries.
☆ GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models
Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.
☆ Scalable predictive processing framework for multitask caregiving robots
The rapid aging of societies is intensifying demand for autonomous care robots; however, most existing systems are task-specific and rely on handcrafted preprocessing, limiting their ability to generalize across diverse scenarios. A prevailing theory in cognitive neuroscience proposes that the human brain operates through hierarchical predictive processing, which underlies flexible cognition and behavior by integrating multimodal sensory signals. Inspired by this principle, we introduce a hierarchical multimodal recurrent neural network grounded in predictive processing under the free-energy principle, capable of directly integrating over 30,000-dimensional visuo-proprioceptive inputs without dimensionality reduction. The model was able to learn two representative caregiving tasks, rigid-body repositioning and flexible-towel wiping, without task-specific feature engineering. We demonstrate three key properties: (i) self-organization of hierarchical latent dynamics that regulate task transitions, capture variability in uncertainty, and infer occluded states; (ii) robustness to degraded vision through visuo-proprioceptive integration; and (iii) asymmetric interference in multitask learning, where the more variable wiping task had little influence on repositioning, whereas learning the repositioning task led to a modest reduction in wiping performance, while the model maintained overall robustness. Although the evaluation was limited to simulation, these results establish predictive processing as a universal and scalable computational principle, pointing toward robust, flexible, and autonomous caregiving robots while offering theoretical insight into the human brain's ability to achieve flexible adaptation in uncertain real-world environments.
☆ Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models ICCV 2025
Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
☆ Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training
Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.
comment: 45 pages, 12 figures
☆ Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation
Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.
☆ Exploring Human-AI Conceptual Alignment through the Prism of Chess
Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85\% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65\% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20\% across all methods, revealing the model's reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.
☆ PORTool: Tool-Use LLM Training with Rewarded Tree
Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
☆ Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis NeurIPS 2025
Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-encoder MoE for subgroup-aware representation learning with a hazard MoE that leverages patient features and time embeddings to capture temporal dynamics. This dual-MoE design flexibly integrates with existing deep learning based survival pipelines. On METABRIC and GBSG breast cancer datasets, our method consistently improves performance, boosting the time-dependent C-index up to 0.04 on the test sets, and yields further gains when incorporated into the Consurv framework.
comment: Accepted to NeurIPS 2025 workshop Learning from Time Series for Health (TS4H)
☆ Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry
Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose System-X, which takes a \emph{hardware-centric} approach, relying only on hardware signals -- fully accessible by operators. Using low-level signals collected from the system, System-X detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using System-X, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
comment: 12 pages, 9 figures, submitted to nsdi 26
☆ The Quest for Reliable Metrics of Responsible AI
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
comment: Accepted for presentation at the AI in Science Summit 2025
☆ Infrequent Exploration in Linear Bandits NeurIPS 2025
We study the problem of infrequent exploration in linear bandits, addressing a significant yet overlooked gap between fully adaptive exploratory methods (e.g., UCB and Thompson Sampling), which explore potentially at every time step, and purely greedy approaches, which require stringent diversity assumptions to succeed. Continuous exploration can be impractical or unethical in safety-critical or costly domains, while purely greedy strategies typically fail without adequate contextual diversity. To bridge these extremes, we introduce a simple and practical framework, INFEX, explicitly designed for infrequent exploration. INFEX executes a base exploratory policy according to a given schedule while predominantly choosing greedy actions in between. Despite its simplicity, our theoretical analysis demonstrates that INFEX achieves instance-dependent regret matching standard provably efficient algorithms, provided the exploration frequency exceeds a logarithmic threshold. Additionally, INFEX is a general, modular framework that allows seamless integration of any fully adaptive exploration method, enabling wide applicability and ease of adoption. By restricting intensive exploratory computations to infrequent intervals, our approach can also enhance computational efficiency. Empirical evaluations confirm our theoretical findings, showing state-of-the-art regret performance and runtime improvements over existing methods.
comment: NeurIPS 2025 camera-ready version
☆ Efficient Online Learning with Predictive Coding Networks: Exploiting Temporal Correlations IROS
Robotic systems operating at the edge require efficient online learning algorithms that can continuously adapt to changing environments while processing streaming sensory data. Traditional backpropagation, while effective, conflicts with biological plausibility principles and may be suboptimal for continuous adaptation scenarios. The Predictive Coding (PC) framework offers a biologically plausible alternative with local, Hebbian-like update rules, making it suitable for neuromorphic hardware implementation. However, PC's main limitation is its computational overhead due to multiple inference iterations during training. We present Predictive Coding Network with Temporal Amortization (PCN-TA), which preserves latent states across temporal frames. By leveraging temporal correlations, PCN-TA significantly reduces computational demands while maintaining learning performance. Our experiments on the COIL-20 robotic perception dataset demonstrate that PCN-TA achieves 10% fewer weight updates compared to backpropagation and requires 50% fewer inference steps than baseline PC networks. These efficiency gains directly translate to reduced computational overhead for moving another step toward edge deployment and real-time adaptation support in resource-constrained robotic systems. The biologically-inspired nature of our approach also makes it a promising candidate for future neuromorphic hardware implementations, enabling efficient online learning at the edge.
comment: Accepted at EdgeAI4R Workshop, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
☆ Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
☆ A General and Streamlined Differentiable Optimization Framework
Differentiating through constrained optimization problems is increasingly central to learning, control, and large-scale decision-making systems, yet practical integration remains challenging due to solver specialization and interface mismatches. This paper presents a general and streamlined framework-an updated DiffOpt.jl-that unifies modeling and differentiation within the Julia optimization stack. The framework computes forward - and reverse-mode solution and objective sensitivities for smooth, potentially nonconvex programs by differentiating the KKT system under standard regularity assumptions. A first-class, JuMP-native parameter-centric API allows users to declare named parameters and obtain derivatives directly with respect to them - even when a parameter appears in multiple constraints and objectives - eliminating brittle bookkeeping from coefficient-level interfaces. We illustrate these capabilities on convex and nonconvex models, including economic dispatch, mean-variance portfolio selection with conic risk constraints, and nonlinear robot inverse kinematics. Two companion studies further demonstrate impact at scale: gradient-based iterative methods for strategic bidding in energy markets and Sobolev-style training of end-to-end optimization proxies using solver-accurate sensitivities. Together, these results demonstrate that differentiable optimization can be deployed as a routine tool for experimentation, learning, calibration, and design-without deviating from standard JuMP modeling practices and while retaining access to a broad ecosystem of solvers.
comment: 17 pages, 4 figures
☆ Contrastive Predictive Coding Done Right for Mutual Information Estimation
The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.
comment: 26 pages, 5 figures
☆ Enabling Fast and Accurate Neutral Atom Readout through Image Denoising
Neutral atom quantum computers hold promise for scaling up to hundreds of thousands of qubits, but their progress is constrained by slow qubit readout. Measuring qubits currently takes milliseconds-much longer than the underlying quantum gate operations-making readout the primary bottleneck in deploying quantum error correction. Because each round of QEC depends on measurement, long readout times increase cycle duration and slow down program execution. Reducing the readout duration speeds up cycles and reduces decoherence errors that accumulate while qubits idle, but it also lowers the number of collected photons, making measurements noisier and more error-prone. This tradeoff leaves neutral atom systems stuck between slow but accurate readout and fast but unreliable readout. We show that image denoising can resolve this tension. Our framework, GANDALF, uses explicit denoising using image translation to reconstruct clear signals from short, low-photon measurements, enabling reliable classification at up to 1.6x shorter readout times. Combined with lightweight classifiers and a pipelined readout design, our approach both reduces logical error rate by up to 35x and overall QEC cycle time up to 1.77x compared to state-of-the-art CNN-based readout for Cesium (Cs) Neutral Atom arrays.
comment: 12 pages, 15 figures
☆ AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
comment: 10 pages, 6 figures, submitted to Ninth Annual Conference on Machine Learning and Systems (MLSys'26)
☆ Risks and Opportunities in Human-Machine Teaming in Operationalizing Machine Learning Target Variables
Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.
comment: 23 pages, 6 figures
☆ On the Dataless Training of Neural Networks
This paper surveys studies on the use of neural networks for optimization in the training-data-free setting. Specifically, we examine the dataless application of neural network architectures in optimization by re-parameterizing problems using fully connected (or MLP), convolutional, graph, and quadratic neural networks. Although MLPs have been used to solve linear programs a few decades ago, this approach has recently gained increasing attention due to its promising results across diverse applications, including those based on combinatorial optimization, inverse problems, and partial differential equations. The motivation for this setting stems from two key (possibly over-lapping) factors: (i) data-driven learning approaches are still underdeveloped and have yet to demonstrate strong results, as seen in combinatorial optimization, and (ii) the availability of training data is inherently limited, such as in medical image reconstruction and other scientific applications. In this paper, we define the dataless setting and categorize it into two variants based on how a problem instance -- defined by a single datum -- is encoded onto the neural network: (i) architecture-agnostic methods and (ii) architecture-specific methods. Additionally, we discuss similarities and clarify distinctions between the dataless neural network (dNN) settings and related concepts such as zero-shot learning, one-shot learning, lifting in optimization, and over-parameterization.
☆ Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs -- A Case Study in Malawi
The reliability of routine health data in low and middle-income countries (LMICs) is often constrained by reporting delays and incomplete coverage, necessitating the exploration of novel data sources and analytics. Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data into mathematical embeddings that can be efficiently used for downstream prediction tasks. This study evaluated the predictive performance of three GeoFM embedding sources - Google Population Dynamics Foundation Model (PDFM), Google AlphaEarth (derived from satellite imagery), and mobile phone call detail records (CDR) - for modeling 15 routine health programmatic outputs in Malawi, and compared their utility to traditional geospatial interpolation methods. We used XGBoost models on data from 552 health catchment areas (January 2021-May 2023), assessing performance with R2, and using an 80/20 training and test data split with 5-fold cross-validation used in training. While predictive performance was mixed, the embedding-based approaches improved upon baseline geostatistical methods in 13 of 15 (87%) indicators tested. A Multi-GeoFM model integrating all three embedding sources produced the most robust predictions, achieving average 5-fold cross validated R2 values for indicators like population density (0.63), new HIV cases (0.57), and child vaccinations (0.47) and test set R2 of 0.64, 0.68, and 0.55, respectively. Prediction was poor for prediction targets with low primary data availability, such as TB and malnutrition cases. These results demonstrate that GeoFM embeddings imbue a modest predictive improvement for select health and demographic outcomes in an LMIC context. We conclude that the integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems.
comment: 13 pages, 3010 words, 2 tables, 2 figures
☆ Modular Linear Tokenization (MLT)
This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).
☆ Revisiting Multilingual Data Mixtures in Language Model Pretraining
The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings
comment: Under Review
☆ InputDSA: Demixing then Comparing Recurrent and Externally Driven Dynamics
In control problems and basic scientific modeling, it is important to compare observations with dynamical simulations. For example, comparing two neural systems can shed light on the nature of emergent computations in the brain and deep neural networks. Recently, Ostrow et al. (2023) introduced Dynamical Similarity Analysis (DSA), a method to measure the similarity of two systems based on their recurrent dynamics rather than geometry or topology. However, DSA does not consider how inputs affect the dynamics, meaning that two similar systems, if driven differently, may be classified as different. Because real-world dynamical systems are rarely autonomous, it is important to account for the effects of input drive. To this end, we introduce a novel metric for comparing both intrinsic (recurrent) and input-driven dynamics, called InputDSA (iDSA). InputDSA extends the DSA framework by estimating and comparing both input and intrinsic dynamic operators using a variant of Dynamic Mode Decomposition with control (DMDc) based on subspace identification. We demonstrate that InputDSA can successfully compare partially observed, input-driven systems from noisy data. We show that when the true inputs are unknown, surrogate inputs can be substituted without a major deterioration in similarity estimates. We apply InputDSA on Recurrent Neural Networks (RNNs) trained with Deep Reinforcement Learning, identifying that high-performing networks are dynamically similar to one another, while low-performing networks are more diverse. Lastly, we apply InputDSA to neural data recorded from rats performing a cognitive task, demonstrating that it identifies a transition from input-driven evidence accumulation to intrinsically-driven decision-making. Our work demonstrates that InputDSA is a robust and efficient method for comparing intrinsic dynamics and the effect of external input on dynamical systems.
comment: 36 pages, 14 figures
☆ Robust GNN Watermarking via Implicit Perception of Topological Invariants
Graph Neural Networks (GNNs) are valuable intellectual property, yet many watermarks rely on backdoor triggers that break under common model edits and create ownership ambiguity. We present InvGNN-WM, which ties ownership to a model's implicit perception of a graph invariant, enabling trigger-free, black-box verification with negligible task impact. A lightweight head predicts normalized algebraic connectivity on an owner-private carrier set; a sign-sensitive decoder outputs bits, and a calibrated threshold controls the false-positive rate. Across diverse node and graph classification datasets and backbones, InvGNN-WM matches clean accuracy while yielding higher watermark accuracy than trigger- and compression-based baselines. It remains strong under unstructured pruning, fine-tuning, and post-training quantization; plain knowledge distillation (KD) weakens the mark, while KD with a watermark loss (KD+WM) restores it. We provide guarantees for imperceptibility and robustness, and we prove that exact removal is NP-complete.
☆ Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
☆ Active Learning with Task-Driven Representations for Messy Pools
Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.
☆ Transferring Causal Effects using Proxies NeurIPS 2025
We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.
comment: Advances in Neural Information Processing Systems (NeurIPS 2025) camera-ready version
☆ MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
comment: Project page: https://nicolas-dufour.github.io/miro
☆ Topology-Aware Active Learning on Graphs
We propose a graph-topological approach to active learning that directly targets the core challenge of exploration versus exploitation under scarce label budgets. To guide exploration, we introduce a coreset construction algorithm based on Balanced Forman Curvature (BFC), which selects representative initial labels that reflect the graph's cluster structure. This method includes a data-driven stopping criterion that signals when the graph has been sufficiently explored. We further use BFC to dynamically trigger the shift from exploration to exploitation within active learning routines, replacing hand-tuned heuristics. To improve exploitation, we introduce a localized graph rewiring strategy that efficiently incorporates multiscale information around labeled nodes, enhancing label propagation while preserving sparsity. Experiments on benchmark classification tasks show that our methods consistently outperform existing graph-based semi-supervised baselines at low label rates.
☆ $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $\pi_0$, $\pi_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $\pi_{\text{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $\pi_{\text{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $\pi_{\text{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $\pi_{\text{RL}}$ boosts few-shot SFT models $\pi_0$ and $\pi_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $\pi_{\text{RL}}$ in 320 parallel environments, improving $\pi_0$ from 41.6% to 85.7% and $\pi_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, $\pi_{\text{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
comment: Preprint, work in progress. 24 pages
☆ Approximating Human Preferences Using a Multi-Judge Learned System
Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
☆ MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs
Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.
comment: Project page, code, data, and models: https://ucsc-vlaa.github.io/MedVLSynther/
☆ Debate2Create: Robot Co-design via Large Language Model Debates
Automating the co-design of a robot's morphology and control is a long-standing challenge due to the vast design space and the tight coupling between body and behavior. We introduce Debate2Create (D2C), a framework in which large language model (LLM) agents engage in a structured dialectical debate to jointly optimize a robot's design and its reward function. In each round, a design agent proposes targeted morphological modifications, and a control agent devises a reward function tailored to exploit the new design. A panel of pluralistic judges then evaluates the design-control pair in simulation and provides feedback that guides the next round of debate. Through iterative debates, the agents progressively refine their proposals, producing increasingly effective robot designs. Notably, D2C yields diverse and specialized morphologies despite no explicit diversity objective. On a quadruped locomotion benchmark, D2C discovers designs that travel 73% farther than the default, demonstrating that structured LLM-based debate can serve as a powerful mechanism for emergent robot co-design. Our results suggest that multi-agent debate, when coupled with physics-grounded feedback, is a promising new paradigm for automated robot design.
☆ ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion NeurIPS 2025
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
comment: NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff
☆ Beyond Long Context: When Semantics Matter More than Tokens
Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.
comment: 12 pages, 5 figures
♻ ☆ 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.
♻ ☆ SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.
comment: 27 pages, 8 figures, 5 tables. Minor update: added corrected acknowledgments and corrected a misstated hyperparameter value (noted in footnote) for reproducibility. Submitted to AAS Journals. Comments welcome
♻ ☆ Curiosity-driven RL for symbolic equation solving NeurIPS 2025
We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.
comment: Accepted at the NeurIPS 2025 MATH-AI Workshop
♻ ☆ The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection
Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M}\,\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, in proportion to the slope of performance with respect to the parameters. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton's method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d'Alembert's principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.
comment: Version 2: fixed definition of force in abstract; Version 3: added citations and some minor editing
♻ ☆ ASGO: Adaptive Structured Gradient Optimization
Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By a fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on this convergence theory, we further demonstrate that ASGO can benefit from low-rank gradients and block diagonal Hessians. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO.
comment: 39 pages
♻ ☆ Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks
Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.
comment: 88 pages; 63 pdf figures
♻ ☆ SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning NeurIPS 2025
Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.
comment: NeurIPS 2025 (Datasets & Benchmarks Track)
♻ ☆ Exact Sequence Interpolation with Transformers SC
We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.
comment: 27 pages, 9 figures. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558)
♻ ☆ MP-FVM: Enhancing Finite Volume Method for Water Infiltration Modeling in Unsaturated Soils via Message-passing Encoder-decoder Network
The spatiotemporal water flow dynamics in unsaturated soils can generally be modeled by the Richards equation. To overcome the computational challenges associated with solving this highly nonlinear partial differential equation (PDE), we present a novel solution algorithm, which we name as the MP-FVM (Message Passing-Finite Volume Method), to holistically integrate adaptive fixed-point iteration scheme, encoder-decoder neural network architecture, Sobolev training, and message passing mechanism in a finite volume discretization framework. We thoroughly discuss the need and benefits of introducing these components to achieve synergistic improvements in accuracy and stability of the solution. We also show that our MP-FVM algorithm can accurately solve the mixed-form $n$-dimensional Richards equation with guaranteed convergence under reasonable assumptions. Through several illustrative examples, we demonstrate that our MP-FVM algorithm not only achieves superior accuracy, but also better preserves the underlying physical laws and mass conservation of the Richards equation compared to state-of-the-art solution algorithms and the commercial HYDRUS solver.
comment: 36 pages, 14 figures, Accepted by Computers and Geotechnics
♻ ☆ Towards Scaling Deep Neural Networks with Predictive Coding: Theory and Practice
Backpropagation (BP) is the standard algorithm for training the deep neural networks that power modern artificial intelligence including large language models. However, BP is energy inefficient and unlikely to be implemented by the brain. This thesis studies an alternative, potentially more efficient brain-inspired algorithm called predictive coding (PC). Unlike BP, PC networks (PCNs) perform inference by iterative equilibration of neuron activities before learning or weight updates. Recent work has suggested that this iterative inference procedure provides a range of benefits over BP, such as faster training. However, these advantages have not been consistently observed, the inference and learning dynamics of PCNs are still poorly understood, and deep PCNs remain practically untrainable. Here, we make significant progress towards scaling PCNs by taking a theoretical approach grounded in optimisation theory. First, we show that the learning dynamics of PC can be understood as an approximate trust-region method using second-order information, despite explicitly using only first-order local updates. Second, going beyond this approximation, we show that PC can in principle make use of arbitrarily higher-order information, such that for feedforward networks the effective landscape on which PC learns is far more benign and robust to vanishing gradients than the (mean squared error) loss landscape. Third, motivated by a study of the inference dynamics of PCNs, we propose a new parameterisation called "$\mu$PC", which for the first time allows stable training of 100+ layer networks with little tuning and competitive performance on simple tasks. Overall, this thesis significantly advances our fundamental understanding of the inference and learning dynamics of PCNs, while highlighting the need for future research to focus on hardware co-design if PC is to compete with BP at scale.
comment: PhD thesis
♻ ☆ Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability
In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.
♻ ☆ Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
We present LrcSSM, a $\textit{non-linear}$ recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Importantly, the diagonal Jacobian structure of our model results in no performance loss compared to the original model with dense Jacobian, and the approach can be generalized to other non-linear recurrent models, demonstrating broader applicability. On a suite of long-range forecasting tasks, we demonstrate that LrcSSM outperforms Transformers, LRU, S5, and Mamba.
♻ ☆ Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs ICDM 2025
The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.
comment: Accepted at AI4FCF-ICDM 2025
♻ ☆ The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
♻ ☆ Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
♻ ☆ Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.
comment: Code and models are available at https://github.com/yophis/decom-renorm-merge
♻ ☆ NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications
To effectively process impulse noise for narrowband powerline communications (NB-PLCs) transceivers, capturing comprehensive statistics of nonperiodic asynchronous impulsive noise (APIN) is a critical task. However, existing mathematical noise generative models only capture part of the characteristics of noise. In this study, we propose a novel generative adversarial network (GAN) called noise generation GAN (NGGAN) that learns the complicated characteristics of practically measured noise samples for data synthesis. To closely match the statistics of complicated noise over the NB-PLC systems, we measured the NB-PLC noise via the analog coupling and bandpass filtering circuits of a commercial NB-PLC modem to build a realistic dataset. To train NGGAN, we adhere to the following principles: 1) we design the length of input signals that the NGGAN model can fit to facilitate cyclostationary noise generation; 2) the Wasserstein distance is used as a loss function to enhance the similarity between the generated noise and training data; and 3) to measure the similarity performances of GAN-based models based on the mathematical and practically measured datasets, we conduct both quantitative and qualitative analyses. The training datasets include: 1) a piecewise spectral cyclostationary Gaussian model (PSCGM); 2) a frequency-shift (FRESH) filter; and 3) practical measurements from NB-PLC systems. Simulation results demonstrate that the generated noise samples from the proposed NGGAN are highly close to the real noise samples. The principal component analysis (PCA) scatter plots and Fr\'echet inception distance (FID) analysis have shown that NGGAN outperforms other GAN-based models by generating noise samples with superior fidelity and higher diversity.
comment: 16 pages, 15 figures, 11 tables, and published in IEEE Transactions on Instrumentation and Measurement, 2025
♻ ☆ A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring
Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN's decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.
♻ ☆ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.
♻ ☆ Pearl: A Foundation Model for Placing Every Atom in the Right Location
Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
comment: technical report
♻ ☆ Auto-Adaptive PINNs with Applications to Phase Transitions
We propose an adaptive sampling method for the training of Physics Informed Neural Networks (PINNs) which allows for sampling based on an arbitrary problem-specific heuristic which may depend on the network and its gradients. In particular we focus our analysis on the Allen-Cahn equations, attempting to accurately resolve the characteristic interfacial regions using a PINN without any post-hoc resampling. In experiments, we show the effectiveness of these methods over residual-adaptive frameworks.
♻ ☆ Continuous Domain Generalization NeurIPS25
Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic contexts. However, existing domain generalization approaches typically treat domains as discrete or as evolving along a single axis (e.g., time). This oversimplification fails to capture the complex, multidimensional nature of real-world variation. This paper introduces the task of Continuous Domain Generalization (CDG), which aims to generalize predictive models to unseen domains defined by arbitrary combinations of continuous variations. We present a principled framework grounded in geometric and algebraic theories, showing that optimal model parameters across domains lie on a low-dimensional manifold. To model this structure, we propose a Neural Lie Transport Operator (NeuralLio), which enables structure-preserving parameter transitions by enforcing geometric continuity and algebraic consistency. To handle noisy or incomplete domain variation descriptors, we introduce a gating mechanism to suppress irrelevant dimensions and a local chart-based strategy for robust generalization. Extensive experiments on synthetic and real-world datasets, including remote sensing, scientific documents, and traffic forecasting, demonstrate that our method significantly outperforms existing baselines in both generalization accuracy and robustness.
comment: 23 pages, 9 figures. Accepted by NeurIPS25
♻ ☆ SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that "more similar" zones have "higher probabilities" of sharing gradients with "larger attention weights." SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
♻ ☆ Tracking the Median of Gradients with a Stochastic Proximal Point Method
There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first derive iterative methods based on the stochastic proximal point method for computing the median gradient and generalizations thereof. Then we propose an algorithm estimating the median gradient across iterations, and find that several well known methods are particular cases of this framework. For instance, we observe that different forms of clipping allow to compute online estimators of the median of gradients, in contrast to (heavy-ball) momentum, which corresponds to an online estimator of the mean. Finally, we provide a theoretical framework for an algorithm computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise.
♻ ☆ TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems
This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing diverse variants of the vehicle routing problem (VRP). Our method uses reinforcement learning to generate high-quality solutions, which are subsequently refined by an efficient local search procedure. To ensure broad adaptability across VRP variants, TuneNSearch begins with a pre-training phase on the multi-depot VRP (MDVRP), followed by a fine-tuning phase to adapt it to other problem formulations. The learning phase utilizes a Transformer-based architecture enhanced with edge-aware attention, which integrates edge distances directly into the attention mechanism to better capture spatial relationships inherent to routing problems. We show that the pre-trained model generalizes effectively to single-depot variants, achieving performance comparable to models trained specifically on single-depot instances. Simultaneously, it maintains strong performance on multi-depot variants, an ability that models pre-trained solely on single-depot problems lack. For example, on 100-node instances of multi-depot variants, TuneNSearch outperforms a model pre-trained on the CVRP by 44%. In contrast, on 100-node instances of single-depot variants, TuneNSearch performs similar to the CVRP model. To validate the effectiveness of our method, we conduct extensive computational experiments on public benchmark and randomly generated instances. Across multiple CVRPLIB datasets, TuneNSearch consistently achieves performance deviations of less than 3% from the best-known solutions in the literature, compared to 6-25% for other neural-based models, depending on problem complexity. Overall, our approach demonstrates strong generalization to different problem sizes, instance distributions, and VRP formulations, while maintaining polynomial runtime complexity despite the integration of the local search algorithm.
♻ ☆ Lift What You Can: Green Online Learning with Heterogeneous Ensembles
Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members' computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel $\zeta$-policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our $\zeta$-policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our $\zeta$-policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.
♻ ☆ Reinforcement Learning Teachers of Test Time Scaling NeurIPS 2025
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
comment: Accepted at NeurIPS 2025
♻ ☆ Group Interventions on Deep Networks for Causal Discovery in Subsystems
Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.
comment: Submitted to IEEE Access. We are working on the revised version
♻ ☆ Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
comment: 8 pages, 2 figures, 1 table. Invited report, CCN 2025 Algonauts Project session (3rd-place team). Code: https://github.com/erensemih/Algonauts2025_ModalityRNN v3: Added equal contribution footnote to author list. Corrected reference list
♻ ☆ SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
♻ ☆ Brain-inspired Computational Intelligence via Predictive Coding
Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with a learning algorithm called error backpropagation, always considered biologically implausible. To this end, recent works have studied learning algorithms for deep neural networks inspired by the neurosciences. One such theory, called predictive coding (PC), has shown promising properties that make it potentially valuable for the machine learning community: it can model information processing in different areas of the brain, can be used in control and robotics, has a solid mathematical foundation in variational inference, and performs its computations asynchronously. Inspired by such properties, works that propose novel PC-like algorithms are starting to be present in multiple sub-fields of machine learning and AI at large. Here, we survey such efforts by first providing a broad overview of the history of PC to provide common ground for the understanding of the recent developments, then by describing current efforts and results, and concluding with a large discussion of possible implications and ways forward.
comment: 26 Pages, 9 Figures
♻ ☆ HyperMARL: Adaptive Hypernetworks for Multi-Agent RL NeurIPS 2025
Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.
comment: To appear at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). A preliminary version of this work was presented at the CoCoMARL workshop, RLC 2025
♻ ☆ Reliable Evaluation and Benchmarks for Statement Autoformalization EMNLP 2025
Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
comment: Accepted to EMNLP 2025. New benchmarks released, see https://github.com/augustepoiroux/RLMEval , https://huggingface.co/datasets/PAug/ProofNetSharp , and https://huggingface.co/datasets/PAug/ProofNetVerif . For code, see https://github.com/augustepoiroux/LeanInteract
♻ ☆ OmegAMP: Targeted AMP Discovery through Biologically Informed Generation
Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.
♻ ☆ Transformers from Compressed Representations
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
♻ ☆ Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station
Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.
comment: 15 pages, 5 figures, submitted for possible publication
♻ ☆ Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes
The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.
comment: Presented at the 37th Benelux Conference on Artificial Intelligence and the 34th Belgian Dutch Conference on Machine Learning (BNAIC/BeNeLearn 2025)
♻ ☆ Exploring End-to-end Differentiable Neural Charged Particle Tracking -- A Loss Landscape Perspective
Measurement and analysis of high energetic particles for scientific, medical or industrial applications is a complex procedure, requiring the design of sophisticated detector and data processing systems. The development of adaptive and differentiable software pipelines using a combination of conventional and machine learning algorithms is therefore getting ever more important to optimize and operate the system efficiently while maintaining end-to-end (E2E) differentiability. We propose for the application of charged particle tracking an E2E differentiable decision-focused learning scheme using graph neural networks with combinatorial components solving a linear assignment problem for each detector layer. We demonstrate empirically that including differentiable variations of discrete assignment operations allows for efficient network optimization, working better or on par with approaches that lack E2E differentiability. In additional studies, we dive deeper into the optimization process and provide further insights from a loss landscape perspective. We demonstrate that while both methods converge into similar performing, globally well-connected regions, they suffer under substantial predictive instability across initialization and optimization methods, which can have unpredictable consequences on the performance of downstream tasks such as image reconstruction. We also point out a dependency between the interpolation factor of the gradient estimator and the prediction stability of the model, suggesting the choice of sufficiently small values. Given the strong global connectivity of learned solutions and the excellent training performance, we argue that E2E differentiability provides, besides the general availability of gradient information, an important tool for robust particle tracking to mitigate prediction instabilities by favoring solutions that perform well on downstream tasks.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ TabArena: A Living Benchmark for Machine Learning on Tabular Data NeurIPS 2025
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
comment: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code; examples at https://tabarena.ai/code-examples; dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation
♻ ☆ GnnXemplar: Exemplars to Explanations -- Natural Language Rules for Global GNN Interpretability NeurIPS 2025
Graph Neural Networks (GNNs) are widely used for node classification, yet their opaque decision-making limits trust and adoption. While local explanations offer insights into individual predictions, global explanation methods, those that characterize an entire class, remain underdeveloped. Existing global explainers rely on motif discovery in small graphs, an approach that breaks down in large, real-world settings where subgraph repetition is rare, node attributes are high-dimensional, and predictions arise from complex structure-attribute interactions. We propose GnnXemplar, a novel global explainer inspired from Exemplar Theory from cognitive science. GnnXemplar identifies representative nodes in the GNN embedding space, exemplars, and explains predictions using natural language rules derived from their neighborhoods. Exemplar selection is framed as a coverage maximization problem over reverse k-nearest neighbors, for which we provide an efficient greedy approximation. To derive interpretable rules, we employ a self-refining prompt strategy using large language models (LLMs). Experiments across diverse benchmarks show that GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.
comment: 38 pages, 20 figures, NeurIPS 2025 (Oral)
♻ ☆ Probabilistic Kernel Function for Fast Angle Testing
In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.
♻ ☆ OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents NeurIPS 2025
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
comment: NeurIPS 2025 Datasets & Benchmarks Track (Spotlight)
♻ ☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
♻ ☆ Learning-Augmented Online Bipartite Fractional Matching NeurIPS 2025
Online bipartite matching is a fundamental problem in online optimization, extensively studied both in its integral and fractional forms due to its theoretical significance and practical applications, such as online advertising and resource allocation. Motivated by recent progress in learning-augmented algorithms, we study online bipartite fractional matching when the algorithm is given advice in the form of a suggested matching in each iteration. We develop algorithms for both the vertex-weighted and unweighted variants that provably dominate the naive "coin flip" strategy of randomly choosing between the advice-following and advice-free algorithms. Moreover, our algorithm for the vertex-weighted setting extends to the AdWords problem under the small bids assumption, yielding a significant improvement over the seminal work of Mahdian, Nazerzadeh, and Saberi (EC 2007, TALG 2012). Complementing our positive results, we establish a hardness bound on the robustness-consistency tradeoff that is attainable by any algorithm. We empirically validate our algorithms through experiments on synthetic and real-world data.
comment: To appear in NeurIPS 2025. Full version
♻ ☆ Differential Mamba AACL 2025
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available: https://github.com/NadavSc/Diff-Mamba
comment: AACL 2025. We provide the code at https://github.com/NadavSc/Diff-Mamba
♻ ☆ Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning
Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.
comment: 16 pages, 4 figures, 4 tables
♻ ☆ Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner NeurIPS 2025
The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.
comment: NeurIPS 2025
♻ ☆ Activation Matching for Explanation Generation
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
♻ ☆ Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting ICLR 2025
The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.
comment: Accepted by ICLR 2025
♻ ☆ Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting NeurIPS 2025
Spatio-temporal forecasting is crucial in many domains, such as transportation, meteorology, and energy. However, real-world scenarios frequently present challenges such as signal anomalies, noise, and distributional shifts. Existing solutions primarily enhance robustness by modifying network architectures or training procedures. Nevertheless, these approaches are computationally intensive and resource-demanding, especially for large-scale applications. In this paper, we explore a novel test-time computing paradigm, namely learning with calibration, ST-TTC, for spatio-temporal forecasting. Through learning with calibration, we aim to capture periodic structural biases arising from non-stationarity during the testing phase and perform real-time bias correction on predictions to improve accuracy. Specifically, we first introduce a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift and then propose a flash updating mechanism with a streaming memory queue for efficient test-time computation. ST-TTC effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm. Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving
User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.
♻ ☆ Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation
For three decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. It provides the fundamental limits of learning random deep neural network targets and helps in identifying the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data optimal performance is attained through model's "specialisation" towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayesian-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant way beyond it.
comment: 30 pages, 19 figures + appendix. This submission supersedes both arXiv:2505.24849 and arXiv:2501.18530. v2 fixes figures
♻ ☆ Explainable Framework for Swarm Intelligence Based on Fitness Landscape Features and Machine Learning
Swarm based optimization algorithms have demonstrated remarkable success in solving complex optimization problems. However, their widespread adoption remains sceptical due to limited transparency in how different algorithmic components influence the overall performance of the algorithm. This work presents a multi-faceted interpretability related investigations of one of the popular swarm algorithms, Particle Swarm Optimization. Through this work, we provide a framework that makes the role of different topologies and parameters in PSO interpretable and explainable using novel machine learning approach. We first developed a comprehensive landscape characterization framework using Exploratory Landscape Analysis to quantify problem difficulty and identify critical features in the problem that affects the optimization performance of PSO. Secondly, we rigorously compare three topologies -- Ring, Star, and Von Neumann -- analyzing their distinct impacts on exploration-exploitation balance, convergence behavior, and solution quality and eventually develop an explainable benchmarking framework for PSO. The work successfully decodes how swarm topologies affect information flow, diversity, and convergence. Through systematic experimentation across 24 benchmark functions in multiple dimensions, we establish practical guidelines for topology selection and parameter configuration. These findings uncover the black-box nature of PSO, providing more transparency and interpretability to swarm intelligence systems. The source code is available at https://github.com/GitNitin02/ioh_pso.
comment: Upated: 29-10-25
♻ ☆ LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities
Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .
comment: Project page: https://ml-jku.github.io/LaM-SLidE/
♻ ☆ MDPs with a State Sensing Cost SP
In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/communication/computation cost. In these settings, the agent's interaction with its environment includes the additional component of deciding when to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining `blind' to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we derive lower bounds on the optimal value function, which allow us to bound the suboptimality gap of any policy. We also propose a computationally efficient algorithm SPI, based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state-of-the-art via a numerical case study.
comment: This revision adds a real-world healthcare benchmark and provides a more detailed explanation of SPI
♻ ☆ Hypergraph clustering using Ricci curvature: an edge transport perspective
In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
♻ ☆ Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
The heterogeneity of multimodal data leads to inconsistencies and imbalance, allowing a dominant modality to steer gradient updates. Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance or conduct its quantitative analysis. To address this gap, we propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function. We define the Modality Gap as the Softmax score difference between modalities for the correct class and model its distribution using a bimodal Gaussian Mixture Model(GMM), representing balanced and imbalanced samples. Using Bayes' theorem, we estimate each sample's posterior probability of belonging to these two groups. Based on this, our adaptive loss (1) minimizes the overall Modality Gap, (2) aligns imbalanced samples with balanced ones, and (3) adaptively penalizes each according to its imbalance degree. A two-stage training strategy-warm-up and adaptive phases,yields state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with high-quality samples identified by the GMM further improves results, highlighting their value for effective multimodal fusion.
♻ ☆ Enlightenment Period Improving DNN Performance
The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50\%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.
♻ ☆ NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.
♻ ☆ Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space
Molecular dynamics (MD) is a powerful tool for exploring the behavior of atomistic systems, but its reliance on sequential numerical integration limits simulation efficiency. We present a novel neural network architecture, MDtrajNet, and a pre-trained foundational model, MDtrajNet-1, that directly generates MD trajectories across chemical space, bypassing force calculations and integration. This approach accelerates simulations by up to two orders of magnitude compared to traditional MD, even those enhanced by machine-learning interatomic potentials. MDtrajNet combines equivariant neural networks with a transformer-based architecture to achieve strong accuracy and transferability in predicting long-time trajectories. Remarkably, the errors of the trajectories generated by MDtrajNet-1 for various known and unseen molecular systems are close to those of the conventional ab initio MD. The architecture's flexible design supports diverse application scenarios, including different statistical ensembles, boundary conditions, and interaction types. By overcoming the intrinsic speed barrier of conventional MD, MDtrajNet opens new frontiers in efficient and scalable atomistic simulations.
♻ ☆ DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer
In this paper, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our DeepRTE framework demonstrates superior computational efficiency for solving the steady-state RTE, surpassing traditional methods and existing neural network approaches. This efficiency is achieved by embedding physical information through derivation of the RTE and mathematically-informed network architecture. Concurrently, DeepRTE achieves high accuracy with significantly fewer parameters, largely due to its incorporation of mechanisms such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural operator framework with inherent zero-shot capability. This is achieved by incorporating Green's function theory and pre-training with delta-function inflow boundary conditions into both its architecture design and training data construction. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.
♻ ☆ Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework's effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2\% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0\% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.
♻ ☆ S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning NeurIPS 2025
Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Conceptually, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S'MoRE emulates the capacity of numerous experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S'MoRE's residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S'MoRE improves structural flexibility of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S'MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. Our implementation is available at: https://github.com/ZimpleX/SMoRE-LLM.
comment: NeurIPS 2025
♻ ☆ Learning single-index models via harmonic decomposition NeurIPS 2025
We study the problem of learning single-index models, where the label $y \in \mathbb{R}$ depends on the input $\boldsymbol{x} \in \mathbb{R}^d$ only through an unknown one-dimensional projection $\langle \boldsymbol{w}_*,\boldsymbol{x}\rangle$. Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering $\boldsymbol{w}_*$ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that $spherical$ $harmonics$ -- rather than $Hermite$ $polynomials$ -- provide the natural basis for this problem, as they capture its intrinsic $rotational$ $symmetry$. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators -- based on tensor unfolding and online SGD -- that respectively achieve either optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.
comment: 84 pages. Comments are welcome. To appear at NeurIPS 2025
♻ ☆ Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.
♻ ☆ Perturbing the Derivative: Wild Refitting for Model-Free Evaluation of Machine Learning Models under Bregman Losses
We study the excess risk evaluation of classical penalized empirical risk minimization (ERM) with Bregman losses. We show that by leveraging the idea of wild refitting, one can efficiently upper bound the excess risk through the so-called "wild optimism," without relying on the global structure of the underlying function class. This property makes our approach inherently model-free. Unlike conventional analysis, our framework operates with just one dataset and black-box access to the training procedure. The method involves randomized Rademacher symmetrization and constructing artificially modified outputs by perturbation in the derivative space with appropriate scaling, upon which we retrain a second predictor for excess risk estimation. We establish high-probability performance guarantees both under the fixed design setting and the random design setting, demonstrating that wild refitting under Bregman losses, with an appropriately chosen wild noise scale, yields a valid upper bound on the excess risk. Thus, our work is promising for theoretically evaluating modern opaque ML models, such as deep neural networks and generative models, where the function class is too complex for classical learning theory and empirical process techniques.
♻ ☆ Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization
Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/\epsilon^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of $O(1/\epsilon^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of $O(1/\epsilon^5)$ for finding an (nearly) $\epsilon$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.
comment: Accepted to Neurips 2025
♻ ☆ Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations
$\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks in which CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $\mathcal{O}(L^3)$ CG paths into a single path without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations. Our code is publicly available as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenMol/TDN}{https://github.com/divelab/AIRS/}).
♻ ☆ Path-specific effects for pulse-oximetry guided decisions in critical care
Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs by dataset. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.
♻ ☆ FedCLF -- Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks
Federated Learning (FL) is a distributed machine learning technique that preserves data privacy by sharing only the trained parameters instead of the client data. This makes FL ideal for highly dynamic, heterogeneous, and time-critical applications, in particular, the Internet of Vehicles (IoV) networks. However, FL encounters considerable challenges in such networks owing to the high data and device heterogeneity. To address these challenges, we propose FedCLF, i.e., FL with Calibrated Loss and Feedback control, which introduces calibrated loss as a utility in the participant selection process and a feedback control mechanism to dynamically adjust the sampling frequency of the clients. The envisaged approach (a) enhances the overall model accuracy in case of highly heterogeneous data and (b) optimizes the resource utilization for resource constrained IoV networks, thereby leading to increased efficiency in the FL process. We evaluated FedCLF vis-\`a-vis baseline models, i.e., FedAvg, Newt, and Oort, using CIFAR-10 dataset with varying data heterogeneity. Our results depict that FedCLF significantly outperforms the baseline models by up to a 16% improvement in high data heterogeneity-related scenarios with improved efficiency via reduced sampling frequency.
comment: Already published in ADMA 2024 on 13th December 2024 Wijethilake, K.E., Mahmood, A., Sheng, Q.Z. (2025). FedCLF - Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15388. Springer, Singapore. https://doi.org/10.1007/978-981-96-0814-0_15
♻ ☆ Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration NeurIPS2025
Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM's varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM's performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM's updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM's capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.
comment: Accepted by NeurIPS2025
♻ ☆ Symplectic Generative Networks (SGNs): A Hamiltonian Framework for Invertible Deep Generative Modeling
We introduce the \emph{Symplectic Generative Network (SGN)}, a deep generative model that leverages Hamiltonian mechanics to construct an invertible, volume-preserving mapping between a latent space and the data space. By endowing the latent space with a symplectic structure and modeling data generation as the time evolution of a Hamiltonian system, SGN achieves exact likelihood evaluation without incurring the computational overhead of Jacobian determinant calculations. In this work, we provide a rigorous mathematical foundation for SGNs through a comprehensive theoretical framework that includes: (i) complete proofs of invertibility and volume preservation, (ii) a formal complexity analysis with theoretical comparisons to Variational Autoencoders and Normalizing Flows, (iii) strengthened universal approximation results with quantitative error bounds, (iv) an information-theoretic analysis based on the geometry of statistical manifolds, and (v) an extensive stability analysis with adaptive integration guarantees. These contributions highlight the fundamental advantages of SGNs and establish a solid foundation for future empirical investigations and applications to complex, high-dimensional data.
comment: Submitted
♻ ☆ AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
comment: 2 tables, 5 figures
♻ ☆ When Truthful Representations Flip Under Deceptive Instructions?
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip'', such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model's instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces. % Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty, offering insights for LLM detection and control. Our findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs.
♻ ☆ Transfer Learning for Kernel-based Regression
In recent years, transfer learning has garnered significant attention. Its ability to leverage knowledge from related studies to improve generalization performance in a target study has made it highly appealing. This paper focuses on investigating the transfer learning problem within the context of nonparametric regression over a reproducing kernel Hilbert space. The aim is to bridge the gap between practical effectiveness and theoretical guarantees. We specifically consider two scenarios: one where the transferable sources are known and another where they are unknown. For the known transferable source case, we propose a two-step kernel-based estimator by solely using kernel ridge regression. For the unknown case, we develop a novel method based on an efficient aggregation algorithm, which can automatically detect and alleviate the effects of negative sources. This paper provides the statistical properties of the desired estimators and establishes the minimax rate. Through extensive numerical experiments on synthetic data and real examples, we validate our theoretical findings and demonstrate the effectiveness of our proposed method.
comment: revised version of "Minimax optimal transfer learning for kernel-based nonparametric regression"
♻ ☆ Kolmogorov-Arnold Energy Models: Fast and Interpretable Generative Modeling
Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a powerful framework for generation across many data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. We propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Kolmogorov-Arnold Energy Model (KAEM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, KAEM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For domains where IS fails, we introduce a strategy based on population-based LMC, decomposing the posterior into a sequence of annealed distributions to improve LMC mixing. KAEM balances common generative modeling trade-offs, offering fast inference, interpretability, and stable training, while being naturally suited to Zettascale Computing hardware.
Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models
A long-standing goal in AI is to develop agents capable of solving diverse tasks across a range of environments, including those never seen during training. Two dominant paradigms address this challenge: (i) reinforcement learning (RL), which learns policies via trial and error, and (ii) optimal control, which plans actions using a known or learned dynamics model. However, their comparative strengths in the offline setting - where agents must learn from reward-free trajectories - remain underexplored. In this work, we systematically evaluate RL and control-based methods on a suite of navigation tasks, using offline datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot methods. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning. We investigate how factors such as data diversity, trajectory quality, and environment variability influence the performance of these approaches. Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts and is more data-efficient, while achieving trajectory stitching performance comparable to leading model-free methods. Notably, planning with a latent dynamics model proves to be a strong approach for handling suboptimal offline data and adapting to diverse environments.
comment: Project web page: https://latent-planning.github.io/
♻ ☆ Thompson Sampling in Function Spaces via Neural Operators NeurIPS 2025
We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that queries to the operator (such as running a high-fidelity simulator or physical experiment) are costly, while functional evaluations on the operator's output are inexpensive. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process (GP) posterior. We derive regret bounds and theoretical results connecting neural operators with GPs in infinite-dimensional settings. Experiments benchmark our method against other Bayesian optimization baselines on functional optimization tasks involving partial differential equations of physical systems, demonstrating better sample efficiency and significant performance gains.
comment: Revised, camera-ready version, accepted at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data
Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability. To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) \textbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) \textbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing. Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.
comment: 11 pages, 5 figures
♻ ☆ Think Then Embed: Generative Context Improves Multimodal Embedding
There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
♻ ☆ AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
comment: https://github.com/Open-Social-World/autolibra
♻ ☆ Time Weaver: A Conditional Time Series Generation Model
Imagine generating a city's electricity demand pattern based on weather, the presence of an electric vehicle, and location, which could be used for capacity planning during a winter freeze. Such real-world time series are often enriched with paired heterogeneous contextual metadata (e.g., weather and location). Current approaches to time series generation often ignore this paired metadata. Additionally, the heterogeneity in metadata poses several practical challenges in adapting existing conditional generation approaches from the image, audio, and video domains to the time series domain. To address this gap, we introduce TIME WEAVER, a novel diffusion-based model that leverages the heterogeneous metadata in the form of categorical, continuous, and even time-variant variables to significantly improve time series generation. Additionally, we show that naive extensions of standard evaluation metrics from the image to the time series domain are insufficient. These metrics do not penalize conditional generation approaches for their poor specificity in reproducing the metadata-specific features in the generated time series. Thus, we innovate a novel evaluation metric that accurately captures the specificity of conditional generation and the realism of the generated time series. We show that TIME WEAVER outperforms state-of-the-art benchmarks, such as Generative Adversarial Networks (GANs), by up to 30% in downstream classification tasks on real-world energy, medical, air quality, and traffic datasets.
♻ ☆ Random pairing MLE for estimation of item parameters in Rasch model
The Rasch model, a classical model in the item response theory, is widely used in psychometrics to model the relationship between individuals' latent traits and their binary responses to assessments or questionnaires. In this paper, we introduce a new likelihood-based estimator -- random pairing maximum likelihood estimator ($\mathrm{RP\text{-}MLE}$) and its bootstrapped variant multiple random pairing MLE ($\mathrm{MRP\text{-}MLE}$) which faithfully estimate the item parameters in the Rasch model. The new estimators have several appealing features compared to existing ones. First, both work for sparse observations, an increasingly important scenario in the big data era. Second, both estimators are provably minimax optimal in terms of finite sample $\ell_{\infty}$ estimation error. Lastly, both admit precise distributional characterization that allows uncertainty quantification on the item parameters, e.g., construction of confidence intervals for the item parameters. The main idea underlying $\mathrm{RP\text{-}MLE}$ and $\mathrm{MRP\text{-}MLE}$ is to randomly pair user-item responses to form item-item comparisons. This is carefully designed to reduce the problem size while retaining statistical independence. We also provide empirical evidence of the efficacy of the two new estimators using both simulated and real data.
♻ ☆ Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum NeurIPS 2025
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
comment: NeurIPS 2025
♻ ☆ Integrating Genomics into Multimodal EHR Foundation Models
This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model's predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture's versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.
♻ ☆ Constrained Posterior Sampling: Time Series Generation with Hard Constraints
Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data. In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature. Consider, for example, generating electricity demand patterns with constraints on peak demand times. This can be used to stress-test the functioning of power grids during adverse weather conditions. Existing approaches for generating constrained time series are either not scalable or degrade sample quality. To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update. Notably, CPS scales to a large number of constraints ($\sim100$) without requiring additional training. We provide theoretical justifications highlighting the impact of our projection step on sampling. Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 70\% and 22\%, respectively, on real-world stocks, traffic, and air quality datasets.
♻ ☆ Reward Collapse in Aligning Large Language Models
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
comment: Accepted for publication in the Journal of Data Science (JDS), reference JDS1201
♻ ☆ OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.
comment: 8 pages, 1 supplementary page, 9 figures
♻ ☆ Language Models can Self-Improve at State-Value Estimation for Better Search
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.
♻ ☆ Hierarchical Graph Networks for Accurate Weather Forecasting via Lightweight Training
Climate events arise from intricate, multivariate dynamics governed by global-scale drivers, profoundly impacting food, energy, and infrastructure. Yet, accurate weather prediction remains elusive due to physical processes unfolding across diverse spatio-temporal scales, which fixed-resolution methods cannot capture. Hierarchical Graph Neural Networks (HGNNs) offer a multiscale representation, but nonlinear downward mappings often erase global trends, weakening the integration of physics into forecasts. We introduce HiFlowCast and its ensemble variant HiAntFlow, HGNNs that embed physics within a multiscale prediction framework. Two innovations underpin their design: a Latent-Memory-Retention mechanism that preserves global trends during downward traversal, and a Latent-to-Physics branch that integrates PDE solution fields across diverse scales. Our Flow models cut errors by over 5% at 13-day lead times and by 5-8% under 1st and 99th quantile extremes, improving reliability for rare events. Leveraging pretrained model weights, they converge within a single epoch, reducing training cost and their carbon footprint. Such efficiency is vital as the growing scale of machine learning challenges sustainability and limits research accessibility. Code and model weights are in the supplementary materials.
♻ ☆ TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields NeurIPS 2025
While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.
comment: Accepted to NeurIPS 2025
♻ ☆ HoGA: Higher-Order Graph Attention via Diversity-Aware k-Hop Sampling
Graphs model latent variable relationships in many real-world systems, and Message Passing Neural Networks (MPNNs) are widely used to learn such structures for downstream tasks. While edge-based MPNNs effectively capture local interactions, their expressive power is theoretically bounded, limiting the discovery of higher-order relationships. We introduce the Higher-Order Graph Attention (HoGA) module, which constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors. Unlike existing higher-order attention methods that greedily resample similar k-order relationships, HoGA targets diverse modalities in higher-order topology, reducing redundancy and expanding the range of captured substructures. Applied to two single-hop attention models, HoGA achieves at least a 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets. Code is available at https://github.com/TB862/Higher_Order.
♻ ☆ FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning
Deep Operator Networks (DeepONets) have recently emerged as powerful data-driven frameworks for learning nonlinear operators, particularly suited for approximating solutions to partial differential equations. Despite their promising capabilities, the standard implementation of DeepONets, which typically employs fully connected linear layers in the trunk network, can encounter limitations in capturing complex spatial structures inherent to various PDEs. To address this limitation, we introduce Fourier-embedded trunk networks within the DeepONet architecture, leveraging random Fourier feature mappings to enrich spatial representation capabilities. Our proposed Fourier-embedded DeepONet (FEDONet) demonstrates superior performance compared to the traditional DeepONet across a comprehensive suite of PDE-driven datasets, including the two-dimensional Poisson, Burgers', Lorenz-63, Eikonal, Allen-Cahn, and the Kuramoto-Sivashinsky equation. Empirical evaluations of FEDONet consistently show significant improvements in solution reconstruction accuracy, with average relative $L^2$ performance gains ranging between 2-3$\times$ compared to the DeepONet baseline. This study highlights the effectiveness of Fourier embeddings in enhancing neural operator learning, offering a robust and broadly applicable methodology for PDE surrogate modeling.
♻ ☆ On the Impossibility of Retrain Equivalence in Machine Unlearning
Machine unlearning seeks to selectively remove the "influence" of specific training data on a model's outputs. The ideal goal is Retrain Equivalence--behavior identical to a model trained from scratch on only the retained data. This goal was formulated for models trained on i.i.d. data batches, but modern pipelines often involve multi-stage training, with each stage having a distinct data distribution and objective. Examples include LLM fine-tuning for alignment, reasoning ability, etc. Our study shows via theory and experiments that this shift to multi-stage training introduces a fundamental barrier for machine unlearning. The theory indicates that the outcome of local unlearning--methods that only use gradients computed on the forget set--is path-dependent. That is, a model's behavior during unlearning is influenced by the order of its training stages during learning, making it impossible for path-oblivious algorithms to universally achieve Retrain Equivalence. We empirically demonstrate the same phenomenon in LLM post-training across Llama and Qwen models (1B to 14B) with gradient ascent, NPO, and SimNPO local unlearning algorithms. Models fine-tuned via different orderings of identical training stages diverge in behavior during unlearning, with the degradation in GSM8K accuracy after unlearning varying by over 20% across paths. We also observe that some learning paths consistently produce models that unlearn slowly. During unlearning, whether the probability mass gets squeezed into paraphrasing or alternative concepts is also path-dependent. These results consistently show that Retrain Equivalence is an ill-posed target for local unlearning algorithms, so long as the target models are trained in stages. In situations where access to models' training histories is hard, the current work calls for rethinking the definition and desiderata of machine unlearning.
comment: Code available at https://princeton-pli.github.io/impossibility-unlearning/
♻ ☆ Diffusion Map Autoencoder
Diffusion-Map-AutoEncoder (DMAE) pairs a diffusion-map encoder (using the Nystr\"om method) with linear or RBF Gaussian-Process latent mean decoders, yielding closed-form inductive mappings and strong reconstructions.
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
comment: Datasets link: https://huggingface.co/datasets/LLDDSS/Causal3D_Dataset
♻ ☆ On the creation of narrow AI: hierarchy and nonlocality of neural network skills NeurIPS 2025
We study the problem of creating strong, yet narrow, AI systems. While recent AI progress has been driven by the training of large general-purpose foundation models, the creation of smaller models specialized for narrow domains could be valuable for both efficiency and safety. In this work, we explore two challenges involved in creating such systems, having to do with basic properties of how neural networks learn and structure their representations. The first challenge regards when it is possible to train narrow models from scratch. Through experiments on a synthetic task, we find that it is sometimes necessary to train networks on a wide distribution of data to learn certain narrow skills within that distribution. This effect arises when skills depend on each other hierarchically, and training on a broad distribution introduces a curriculum which substantially accelerates learning. The second challenge regards how to transfer particular skills from large general models into small specialized models. We find that model skills are often not perfectly localized to a particular set of prunable components. However, we find that methods based on pruning can still outperform distillation. We investigate the use of a regularization objective to align desired skills with prunable components while unlearning unnecessary skills.
comment: NeurIPS 2025; 20 pages, 13 figures
♻ ☆ Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning
Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and gradient propagation on the GPU. Across 13 benchmarks spanning tasks over text, image, and video data, with symbolic reasoning features like recursion and black-box functions, DOLPHIN converges to state-of-the-art accuracies on the more complex benchmarks while existing frameworks such as Scallop, ISED, and IndeCateR+ fail to converge within the time limit. On simpler benchmarks, DOLPHIN matches their performance, while achieving these results 1.71x to 62x faster than the baselines. Overall, DOLPHIN advances the scalability of neurosymbolic frameworks, achieving state-of-the-art efficiency and convergence on difficult benchmarks where existing frameworks struggle. The code is published at https://github.com/Dolphin-NeSy/Dolphin.
♻ ☆ Infinite-dimensional Mahalanobis Distance with Applications to Kernelized Novelty Detection
The Mahalanobis distance is a classical tool used to measure the covariance-adjusted distance between points in $\bbR^d$. In this work, we extend the concept of Mahalanobis distance to separable Banach spaces by reinterpreting it as a Cameron-Martin norm associated with a probability measure. This approach leads to a basis-free, data-driven notion of anomaly distance through the so-called variance norm, which can naturally be estimated using empirical measures of a sample. Our framework generalizes the classical $\bbR^d$, functional $(L^2[0,1])^d$, and kernelized settings; importantly, it incorporates non-injective covariance operators. We prove that the variance norm is invariant under invertible bounded linear transformations of the data, extending previous results which are limited to unitary operators. In the Hilbert space setting, we connect the variance norm to the RKHS of the covariance operator, and establish consistency and convergence results for estimation using empirical measures with Tikhonov regularization. Using the variance norm, we introduce the notion of a kernelized nearest-neighbour Mahalanobis distance, and study some of its finite-sample concentration properties. In an empirical study on 12 real-world data sets, we demonstrate that the kernelized nearest-neighbour Mahalanobis distance outperforms the traditional kernelized Mahalanobis distance for multivariate time series novelty detection, using state-of-the-art time series kernels such as the signature, global alignment, and Volterra reservoir kernels.
♻ ☆ Explainable post-training bias mitigation with distribution-based fairness metrics
We develop a novel bias mitigation framework with distribution-based fairness constraints suitable for producing demographically blind and explainable machine-learning models across a wide range of fairness levels. This is accomplished through post-processing, allowing fairer models to be generated efficiently without retraining the underlying model. Our framework, which is based on stochastic gradient descent, can be applied to a wide range of model types, with a particular emphasis on the post-processing of gradient-boosted decision trees. Additionally, we design a broad family of global fairness metrics, along with differentiable and consistent estimators compatible with our framework, building on previous work. We empirically test our methodology on a variety of datasets and compare it with alternative post-processing approaches, including Bayesian search, optimal transport projection, and direct neural network training.
comment: 45 pages, 4 figures
♻ ☆ Neural Atlas Graphs for Dynamic Scene Decomposition and Editing
Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/
♻ ☆ Plasticity as the Mirror of Empowerment NeurIPS 2025
Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency
comment: NeurIPS 2025
Graphics 9
☆ FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
☆ mitransient: Transient light transport in Mitsuba 3
mitransient is a light transport simulation tool that extends Mitsuba 3 with support for time-resolved simulations. In essence, mitransient extends conventional rendering by adding a temporal dimension which accounts for the time of flight of light. This allows rapid prototyping of novel transient imaging systems without the need of costly or difficult-to-operate hardware. Our code is trivially easy to install through pip, and consists of Python modules that can run both in CPU and GPU by leveraging the JIT capabilities of Mitsuba 3. It provides physically-based simulations of complex phenomena, including a wide variety of realistic materials and participating media such as fog or smoke. In addition, we extend Mitsuba 3's functionality to support time-resolved polarization tracking of light and transient differentiable rendering. Finally, we also include tools that simplify the use of our simulations for non-line-of-sight imaging, enabling realistic scene setups with capture noise to be simulated in just seconds of minutes. Altogether, we hope that mitransient will support the research community in developing novel algorithms for transient imaging.
comment: 3 pages, 3 figures. For further documentation for mitransient see https://mitransient.readthedocs.io
☆ 4-Doodle: Text to 3D Sketches that Move!
We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable B\'ezier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.
☆ Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation
Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.
comment: 18 pages, 6 figures, accepted to ICXR 2025 conference
☆ Fast and Robust Point Containment Queries on Trimmed Surface
Point containment queries on trimmed surfaces are fundamental to CAD modeling, solid geometry processing, and surface tessellation. Existing approaches such as ray casting and generalized winding numbers often face limitations in robustness and computational efficiency. We propose a fast and numerically stable method for performing containment queries on trimmed surfaces, including those with periodic parameterizations. Our approach introduces a recursive winding number computation scheme that replaces costly curve subdivision with an ellipse-based bound for Bezier segments, enabling linear-time evaluation. For periodic surfaces, we lift trimming curves to the universal covering space, allowing accurate and consistent winding number computation even for non-contractible or discontinuous loops in parameter domain. Experiments show that our method achieves substantial speedups over existing winding-number algorithms while maintaining high robustness in the presence of geometric noise, open boundaries, and periodic topologies. We further demonstrate its effectiveness in processing real B-Rep models and in robust tessellation of trimmed surfaces.
☆ Off-Centered WoS-Type Solvers with Statistical Weighting SIGGRAPH
Stochastic PDE solvers have emerged as a powerful alternative to traditional discretization-based methods for solving partial differential equations (PDEs), especially in geometry processing and graphics. While off-centered estimators enhance sample reuse in WoS-type Monte Carlo solvers, they introduce correlation artifacts and bias when Green's functions are approximated. In this paper, we propose a statistically weighted off-centered WoS-type estimator that leverages local similarity filtering to selectively combine samples across neighboring evaluation points. Our method balances bias and variance through a principled weighting strategy that suppresses unreliable estimators. We demonstrate our approach's effectiveness on various PDEs,including screened Poisson equations and boundary conditions, achieving consistent improvements over existing solvers such as vanilla Walk on Spheres, mean value caching, and boundary value caching. Our method also naturally extends to gradient field estimation and mixed boundary problems.
comment: SIGGRAPH Asia 2025 conference paper
☆ LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency
Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x -- 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% -- 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC's ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.
♻ ☆ HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans SIGGRAPH
We present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7\times$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.
comment: SIGGRAPH Asia 2025, Project Page: https://acennr-engine.github.io/HRM2Avatar
♻ ☆ Neural Atlas Graphs for Dynamic Scene Decomposition and Editing
Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes. Project Page: https://princeton-computational-imaging.github.io/nag/
Robotics 69
☆ Embodying Physical Computing into Soft Robots
Softening and onboarding computers and controllers is one of the final frontiers in soft robotics towards their robustness and intelligence for everyday use. In this regard, embodying soft and physical computing presents exciting potential. Physical computing seeks to encode inputs into a mechanical computing kernel and leverage the internal interactions among this kernel's constituent elements to compute the output. Moreover, such input-to-output evolution can be re-programmable. This perspective paper proposes a framework for embodying physical computing into soft robots and discusses three unique strategies in the literature: analog oscillators, physical reservoir computing, and physical algorithmic computing. These embodied computers enable the soft robot to perform complex behaviors that would otherwise require CMOS-based electronics -- including coordinated locomotion with obstacle avoidance, payload weight and orientation classification, and programmable operation based on logical rules. This paper will detail the working principles of these embodied physical computing methods, survey the current state-of-the-art, and present a perspective for future development.
☆ A Framework for the Systematic Evaluation of Obstacle Avoidance and Object-Aware Controllers
Real-time control is an essential aspect of safe robot operation in the real world with dynamic objects. We present a framework for the analysis of object-aware controllers, methods for altering a robot's motion to anticipate and avoid possible collisions. This framework is focused on three design considerations: kinematics, motion profiles, and virtual constraints. Additionally, the analysis in this work relies on verification of robot behaviors using fundamental robot-obstacle experimental scenarios. To showcase the effectiveness of our method we compare three representative object-aware controllers. The comparison uses metrics originating from the design considerations. From the analysis, we find that the design of object-aware controllers often lacks kinematic considerations, continuity of control points, and stability in movement profiles. We conclude that this framework can be used in the future to design, compare, and benchmark obstacle avoidance methods.
☆ Fare: Failure Resilience in Learned Visual Navigation Control
While imitation learning (IL) enables effective visual navigation, IL policies are prone to unpredictable failures in out-of-distribution (OOD) scenarios. We advance the notion of failure-resilient policies, which not only detect failures but also recover from them automatically. Failure recognition that identifies the factors causing failure is key to informing recovery: e.g. pinpointing image regions triggering failure detections can provide cues to guide recovery. We present Fare, a framework to construct failure-resilient IL policies, embedding OOD-detection and recognition in them without using explicit failure data, and pairing them with recovery heuristics. Real-world experiments show that Fare enables failure recovery across two different policy architectures, enabling robust long-range navigation in complex environments.
☆ Feature Matching-Based Gait Phase Prediction for Obstacle Crossing Control of Powered Transfemoral Prosthesis
For amputees with powered transfemoral prosthetics, navigating obstacles or complex terrain remains challenging. This study addresses this issue by using an inertial sensor on the sound ankle to guide obstacle-crossing movements. A genetic algorithm computes the optimal neural network structure to predict the required angles of the thigh and knee joints. A gait progression prediction algorithm determines the actuation angle index for the prosthetic knee motor, ultimately defining the necessary thigh and knee angles and gait progression. Results show that when the standard deviation of Gaussian noise added to the thigh angle data is less than 1, the method can effectively eliminate noise interference, achieving 100\% accuracy in gait phase estimation under 150 Hz, with thigh angle prediction error being 8.71\% and knee angle prediction error being 6.78\%. These findings demonstrate the method's ability to accurately predict gait progression and joint angles, offering significant practical value for obstacle negotiation in powered transfemoral prosthetics.
comment: 6 pages, conference
☆ Multi-Agent Scenario Generation in Roundabouts with a Transformer-enhanced Conditional Variational Autoencoder
With the increasing integration of intelligent driving functions into serial-produced vehicles, ensuring their functionality and robustness poses greater challenges. Compared to traditional road testing, scenario-based virtual testing offers significant advantages in terms of time and cost efficiency, reproducibility, and exploration of edge cases. We propose a Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) model for generating multi-agent traffic scenarios in roundabouts, which are characterized by high vehicle dynamics and complex layouts, yet remain relatively underexplored in current research. The results show that the proposed model can accurately reconstruct original scenarios and generate realistic, diverse synthetic scenarios. Besides, two Key-Performance-Indicators (KPIs) are employed to evaluate the interactive behavior in the generated scenarios. Analysis of the latent space reveals partial disentanglement, with several latent dimensions exhibiting distinct and interpretable effects on scenario attributes such as vehicle entry timing, exit timing, and velocity profiles. The results demonstrate the model's capability to generate scenarios for the validation of intelligent driving functions involving multi-agent interactions, as well as to augment data for their development and iterative improvement.
☆ GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization
In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird's-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at https://github.com/dcmlr/groundloc.
☆ Towards Quadrupedal Jumping and Walking for Dynamic Locomotion using Reinforcement Learning
This paper presents a curriculum-based reinforcement learning framework for training precise and high-performance jumping policies for the robot `Olympus'. Separate policies are developed for vertical and horizontal jumps, leveraging a simple yet effective strategy. First, we densify the inherently sparse jumping reward using the laws of projectile motion. Next, a reference state initialization scheme is employed to accelerate the exploration of dynamic jumping behaviors without reliance on reference trajectories. We also present a walking policy that, when combined with the jumping policies, unlocks versatile and dynamic locomotion capabilities. Comprehensive testing validates walking on varied terrain surfaces and jumping performance that exceeds previous works, effectively crossing the Sim2Real gap. Experimental validation demonstrates horizontal jumps up to 1.25 m with centimeter accuracy and vertical jumps up to 1.0 m. Additionally, we show that with only minor modifications, the proposed method can be used to learn omnidirectional jumping.
comment: 8 pages
☆ Spatiotemporal Calibration of Doppler Velocity Logs for Underwater Robots
The calibration of extrinsic parameters and clock offsets between sensors for high-accuracy performance in underwater SLAM systems remains insufficiently explored. Existing methods for Doppler Velocity Log (DVL) calibration are either constrained to specific sensor configurations or rely on oversimplified assumptions, and none jointly estimate translational extrinsics and time offsets. We propose a Unified Iterative Calibration (UIC) framework for general DVL sensor setups, formulated as a Maximum A Posteriori (MAP) estimation with a Gaussian Process (GP) motion prior for high-fidelity motion interpolation. UIC alternates between efficient GP-based motion state updates and gradient-based calibration variable updates, supported by a provably statistically consistent sequential initialization scheme. The proposed UIC can be applied to IMU, cameras and other modalities as co-sensors. We release an open-source DVL-camera calibration toolbox. Beyond underwater applications, several aspects of UIC-such as the integration of GP priors for MAP-based calibration and the design of provably reliable initialization procedures-are broadly applicable to other multi-sensor calibration problems. Finally, simulations and real-world tests validate our approach.
☆ An Adaptive Inspection Planning Approach Towards Routine Monitoring in Uncertain Environments ICRA 2026
In this work, we present a hierarchical framework designed to support robotic inspection under environment uncertainty. By leveraging a known environment model, existing methods plan and safely track inspection routes to visit points of interest. However, discrepancies between the model and actual site conditions, caused by either natural or human activities, can alter the surface morphology or introduce path obstructions. To address this challenge, the proposed framework divides the inspection task into: (a) generating the initial global view-plan for region of interests based on a historical map and (b) local view replanning to adapt to the current morphology of the inspection scene. The proposed hierarchy preserves global coverage objectives while enabling reactive adaptation to the local surface morphology. This enables the local autonomy to remain robust against environment uncertainty and complete the inspection tasks. We validate the approach through deployments in real-world subterranean mines using quadrupedal robot.
comment: Submitted for ICRA 2026
☆ GeVI-SLAM: Gravity-Enhanced Stereo Visua Inertial SLAM for Underwater Robots
Accurate visual inertial simultaneous localization and mapping (VI SLAM) for underwater robots remains a significant challenge due to frequent visual degeneracy and insufficient inertial measurement unit (IMU) motion excitation. In this paper, we present GeVI-SLAM, a gravity-enhanced stereo VI SLAM system designed to address these issues. By leveraging the stereo camera's direct depth estimation ability, we eliminate the need to estimate scale during IMU initialization, enabling stable operation even under low acceleration dynamics. With precise gravity initialization, we decouple the pitch and roll from the pose estimation and solve a 4 degrees of freedom (DOF) Perspective-n-Point (PnP) problem for pose tracking. This allows the use of a minimal 3-point solver, which significantly reduces computational time to reject outliers within a Random Sample Consensus framework. We further propose a bias-eliminated 4-DOF PnP estimator with provable consistency, ensuring the relative pose converges to the true value as the feature number increases. To handle dynamic motion, we refine the full 6-DOF pose while jointly estimating the IMU covariance, enabling adaptive weighting of the gravity prior. Extensive experiments on simulated and real-world data demonstrate that GeVI-SLAM achieves higher accuracy and greater stability compared to state-of-the-art methods.
☆ Stochastic Prize-Collecting Games: Strategic Planning in Multi-Robot Systems
The Team Orienteering Problem (TOP) generalizes many real-world multi-robot scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-robot systems, they assume that all the robots cooperate toward a single objective; thus, they do not extend to settings where the robots compete in reward-scarce environments. We propose Stochastic Prize-Collecting Games (SPCG) as an extension of the TOP to plan in the presence of self-interested robots operating on a graph, under energy constraints and stochastic transitions. A theoretical study on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCGs that coincides with the optimal routing solution of an equivalent TOP given a rank-based conflict resolution rule. This work proposes two algorithms: Ordinal Rank Search (ORS) to obtain the ''ordinal rank'' --one's effective rank in temporarily-formed local neighborhoods during the games' stages, and Fictitious Ordinal Response Learning (FORL) to obtain best-response policies against one's senior-rank opponents. Empirical evaluations conducted on road networks and synthetic graphs under both dynamic and stationary prize distributions show that 1) the state-aliasing induced by OR-conditioning enables learning policies that scale more efficiently to large team sizes than those trained with the global index, and 2) Policies trained with FORL generalize better to imbalanced prize distributions than those with other multi-agent training methods. Finally, the learned policies in the SPCG achieved between 87% and 95% optimality compared to an equivalent TOP solution obtained by mixed-integer linear programming.
comment: Submitted to IEEE Robotics and Automation Letters
☆ Supervisory Measurement-Guided Noise Covariance Estimation
Reliable state estimation hinges on accurate specification of sensor noise covariances, which weigh heterogeneous measurements. In practice, these covariances are difficult to identify due to environmental variability, front-end preprocessing, and other reasons. We address this by formulating noise covariance estimation as a bilevel optimization that, from a Bayesian perspective, factorizes the joint likelihood of so-called odometry and supervisory measurements, thereby balancing information utilization with computational efficiency. The factorization converts the nested Bayesian dependency into a chain structure, enabling efficient parallel computation: at the lower level, an invariant extended Kalman filter with state augmentation estimates trajectories, while a derivative filter computes analytical gradients in parallel for upper-level gradient updates. The upper level refines the covariance to guide the lower-level estimation. Experiments on synthetic and real-world datasets show that our method achieves higher efficiency over existing baselines.
☆ Sample-efficient and Scalable Exploration in Continuous-Time RL
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
comment: 26 pages, 6 figures, 6 tables
☆ Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks
Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most --200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems.
☆ Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance
This paper presents an optimal trajectory generation method for 3D overhead cranes by leveraging differential flatness. This framework enables the direct inclusion of complex physical and dynamic constraints, such as nonlinear friction and collision avoidance for both payload and rope. Our approach allows for aggressive movements by constraining payload swing only at the final point. A comparative simulation study validates our approach, demonstrating that neglecting dry friction leads to actuator saturation and collisions. The results show that friction modeling is a fundamental requirement for fast and safe crane trajectories.
comment: 8 pages, 11 figures
☆ A Hybrid Approach for Visual Multi-Object Tracking
This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2
comment: This work has been submitted to the IEEE for possible publication
☆ GenTrack: A New Generation of Multi-Object Tracking
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
comment: This work has been submitted to the IEEE for possible publication
☆ NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation
We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system's ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at https://youtu.be/tTiIQt6nXC8
comment: 9 pages, 10 figures
☆ Global-State-Free Obstacle Avoidance for Quadrotor Control in Air-Ground Cooperation
CoNi-MPC provides an efficient framework for UAV control in air-ground cooperative tasks by relying exclusively on relative states, eliminating the need for global state estimation. However, its lack of environmental information poses significant challenges for obstacle avoidance. To address this issue, we propose a novel obstacle avoidance algorithm, Cooperative Non-inertial frame-based Obstacle Avoidance (CoNi-OA), designed explicitly for UAV-UGV cooperative scenarios without reliance on global state estimation or obstacle prediction. CoNi-OA uniquely utilizes a single frame of raw LiDAR data from the UAV to generate a modulation matrix, which directly adjusts the quadrotor's velocity to achieve obstacle avoidance. This modulation-based method enables real-time generation of collision-free trajectories within the UGV's non-inertial frame, significantly reducing computational demands (less than 5 ms per iteration) while maintaining safety in dynamic and unpredictable environments. The key contributions of this work include: (1) a modulation-based obstacle avoidance algorithm specifically tailored for UAV-UGV cooperation in non-inertial frames without global states; (2) rapid, real-time trajectory generation based solely on single-frame LiDAR data, removing the need for obstacle modeling or prediction; and (3) adaptability to both static and dynamic environments, thus extending applicability to featureless or unknown scenarios.
☆ DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation NeurIPS 2025
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
comment: Accepted to NeurIPS 2025
☆ Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?
Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
☆ Manipulate as Human: Learning Task-oriented Manipulation Skills by Adversarial Motion Priors
In recent years, there has been growing interest in developing robots and autonomous systems that can interact with human in a more natural and intuitive way. One of the key challenges in achieving this goal is to enable these systems to manipulate objects and tools in a manner that is similar to that of humans. In this paper, we propose a novel approach for learning human-style manipulation skills by using adversarial motion priors, which we name HMAMP. The approach leverages adversarial networks to model the complex dynamics of tool and object manipulation, as well as the aim of the manipulation task. The discriminator is trained using a combination of real-world data and simulation data executed by the agent, which is designed to train a policy that generates realistic motion trajectories that match the statistical properties of human motion. We evaluated HMAMP on one challenging manipulation task: hammering, and the results indicate that HMAMP is capable of learning human-style manipulation skills that outperform current baseline methods. Additionally, we demonstrate that HMAMP has potential for real-world applications by performing real robot arm hammering tasks. In general, HMAMP represents a significant step towards developing robots and autonomous systems that can interact with humans in a more natural and intuitive way, by learning to manipulate tools and objects in a manner similar to how humans do.
☆ Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames
Behavioral cloning is a simple yet effective technique for learning sequential decision-making from demonstrations. Recently, it has gained prominence as the core of foundation models for the physical world, where achieving generalization requires countless demonstrations of a multitude of tasks. Typically, a human expert with full information on the task demonstrates a (nearly) optimal behavior. In this paper, we propose to hide some of the task's information from the demonstrator. This ``blindfolded'' expert is compelled to employ non-trivial exploration to solve the task. We show that cloning the blindfolded expert generalizes better to unseen tasks than its fully-informed counterpart. We conduct experiments of real-world robot peg insertion tasks with (limited) human demonstrations, alongside videogames from the Procgen benchmark. Additionally, we support our findings with theoretical analysis, which confirms that the generalization error scales with $\sqrt{I/m}$, where $I$ measures the amount of task information available to the demonstrator, and $m$ is the number of demonstrated tasks. Both theory and practice indicate that cloning blindfolded experts generalizes better with fewer demonstrated tasks. Project page with videos and code: https://sites.google.com/view/blindfoldedexperts/home
☆ BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.
☆ LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation
Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo's memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation. Project page: https://weekgoodday.github.io/lagmemo
☆ PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI
The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.
☆ ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring
End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.
☆ Learning Parameterized Skills from Demonstrations
We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
comment: Neurips 2025
☆ Dynamically-Consistent Trajectory Optimization for Legged Robots via Contact Point Decomposition
To generate reliable motion for legged robots through trajectory optimization, it is crucial to simultaneously compute the robot's path and contact sequence, as well as accurately consider the dynamics in the problem formulation. In this paper, we present a phase-based trajectory optimization that ensures the feasibility of translational dynamics and friction cone constraints throughout the entire trajectory. Specifically, our approach leverages the superposition properties of linear differential equations to decouple the translational dynamics for each contact point, which operates under different phase sequences. Furthermore, we utilize the differentiation matrix of B{\'e}zier polynomials to derive an analytical relationship between the robot's position and force, thereby ensuring the consistent satisfaction of translational dynamics. Additionally, by exploiting the convex closure property of B{\'e}zier polynomials, our method ensures compliance with friction cone constraints. Using the aforementioned approach, the proposed trajectory optimization framework can generate dynamically reliable motions with various gait sequences for legged robots. We validate our framework using a quadruped robot model, focusing on the feasibility of dynamics and motion generation.
comment: 8 pages, 4 figures, IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED OCTOBER, 2025
☆ Balanced Collaborative Exploration via Distributed Topological Graph Voronoi Partition
This work addresses the collaborative multi-robot autonomous online exploration problem, particularly focusing on distributed exploration planning for dynamically balanced exploration area partition and task allocation among a team of mobile robots operating in obstacle-dense non-convex environments. We present a novel topological map structure that simultaneously characterizes both spatial connectivity and global exploration completeness of the environment. The topological map is updated incrementally to utilize known spatial information for updating reachable spaces, while exploration targets are planned in a receding horizon fashion under global coverage guidance. A distributed weighted topological graph Voronoi algorithm is introduced implementing balanced graph space partitions of the fused topological maps. Theoretical guarantees are provided for distributed consensus convergence and equitable graph space partitions with constant bounds. A local planner optimizes the visitation sequence of exploration targets within the balanced partitioned graph space to minimize travel distance, while generating safe, smooth, and dynamically feasible motion trajectories. Comprehensive benchmarking against state-of-the-art methods demonstrates significant improvements in exploration efficiency, completeness, and workload balance across the robot team.
☆ Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
comment: 8 pages
☆ SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird's-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.
☆ Improved Accuracy of Robot Localization Using 3-D LiDAR in a Hippocampus-Inspired Model
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
comment: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025
☆ VOCALoco: Viability-Optimized Cost-aware Adaptive Locomotion
Recent advancements in legged robot locomotion have facilitated traversal over increasingly complex terrains. Despite this progress, many existing approaches rely on end-to-end deep reinforcement learning (DRL), which poses limitations in terms of safety and interpretability, especially when generalizing to novel terrains. To overcome these challenges, we introduce VOCALoco, a modular skill-selection framework that dynamically adapts locomotion strategies based on perceptual input. Given a set of pre-trained locomotion policies, VOCALoco evaluates their viability and energy-consumption by predicting both the safety of execution and the anticipated cost of transport over a fixed planning horizon. This joint assessment enables the selection of policies that are both safe and energy-efficient, given the observed local terrain. We evaluate our approach on staircase locomotion tasks, demonstrating its performance in both simulated and real-world scenarios using a quadrupedal robot. Empirical results show that VOCALoco achieves improved robustness and safety during stair ascent and descent compared to a conventional end-to-end DRL policy
comment: Accepted in IEEE Robotics and Automation Letters (RAL), 2025. 8 pages, 9 figures
☆ A Survey on Collaborative SLAM with 3D Gaussian Splatting
This survey comprehensively reviews the evolving field of multi-robot collaborative Simultaneous Localization and Mapping (SLAM) using 3D Gaussian Splatting (3DGS). As an explicit scene representation, 3DGS has enabled unprecedented real-time, high-fidelity render- ing, ideal for robotics. However, its use in multi-robot systems introduces significant challenges in maintaining global consistency, managing communication, and fusing data from heterogeneous sources. We systematically categorize approaches by their architecture-centralized, distributed- and analyze core components like multi-agent consistency and alignment, communication- efficient, Gaussian representation, semantic distillation, fusion and pose optimization, and real- time scalability. In addition, a summary of critical datasets and evaluation metrics is provided to contextualize performance. Finally, we identify key open challenges and chart future research directions, including lifelong mapping, semantic association and mapping, multi-model for robustness, and bridging the Sim2Real gap.
☆ Adaptive-twist Soft Finger Mechanism for Grasping by Wrapping
This paper presents a soft robot finger capable of adaptive-twist deformation to grasp objects by wrapping them. For a soft hand to grasp and pick-up one object from densely contained multiple objects, a soft finger requires the adaptive-twist deformation function in both in-plane and out-of-plane directions. The function allows the finger to be inserted deeply into a limited gap among objects. Once inserted, the soft finger requires appropriate control of grasping force normal to contact surface, thereby maintaining the twisted deformation. In this paper, we refer to this type of grasping as grasping by wrapping. To achieve these two functions by a single actuation source, we propose a variable stiffness mechanism that can adaptively change the stiffness as the pressure is higher. We conduct a finite element analysis (FEA) on the proposed mechanism and determine its design parameter based on the FEA result. Using the developed soft finger, we report basic experimental results and demonstrations on grasping various objects.
☆ A Comprehensive General Model of Tendon-Actuated Concentric Tube Robots with Multiple Tubes and Tendons
Tendon-actuated concentric tube mechanisms combine the advantages of tendon-driven continuum robots and concentric tube robots while addressing their respective limitations. They overcome the restricted degrees of freedom often seen in tendon-driven designs, and mitigate issues such as snapping instability associated with concentric tube robots. However, a complete and general mechanical model for these systems remains an open problem. In this work, we propose a Cosserat rod-based framework for modeling the general case of $n$ concentric tubes, each actuated by $m_i$ tendons, where $i = \{1, \ldots, n\}$. The model allows each tube to twist and elongate while enforcing a shared centerline for bending. We validate the proposed framework through experiments with two-tube and three tube assemblies under various tendon routing configurations, achieving tip prediction errors $<4\%$ of the robot's total length. We further demonstrate the model's generality by applying it to existing robots in the field, where maximum tip deviations remain around $5\%$ of the total length. This model provides a foundation for accurate shape estimation and control of advanced tendon-actuated concentric tube robots.
☆ Defect Mitigation for Robot Arm-based Additive Manufacturing Utilizing Intelligent Control and IOT
This paper presents an integrated robotic fused deposition modeling additive manufacturing system featuring closed-loop thermal control and intelligent in-situ defect correction using a 6-degree of freedom robotic arm and an Oak-D camera. The robot arm end effector was modified to mount an E3D hotend thermally regulated by an IoT microcontroller, enabling precise temperature control through real-time feedback. Filament extrusion system was synchronized with robotic motion, coordinated via ROS2, ensuring consistent deposition along complex trajectories. A vision system based on OpenCV detects layer-wise defects position, commanding autonomous re-extrusion at identified sites. Experimental validation demonstrated successful defect mitigation in printing operations. The integrated system effectively addresses challenges real-time quality assurance. Inverse kinematics were used for motion planning, while homography transformations corrected camera perspectives for accurate defect localization. The intelligent system successfully mitigated surface anomalies without interrupting the print process. By combining real-time thermal regulation, motion control, and intelligent defect detection & correction, this architecture establishes a scalable and adaptive robotic additive manufacturing framework suitable for aerospace, biomedical, and industrial applications.
comment: This Paper Has Accepted at ASME 2025 International Mechanical Engineering Congress and Exposition (IMECE 2025)
☆ Smooth path planning with safety margins using Piece-Wise Bezier curves
In this paper, we propose a computationally efficient quadratic programming (QP) approach for generating smooth, $C^1$ continuous paths for mobile robots using piece-wise quadratic Bezier (PWB) curves. Our method explicitly incorporates safety margins within a structured optimization framework, balancing trajectory smoothness and robustness with manageable numerical complexity suitable for real-time and embedded applications. Comparative simulations demonstrate clear advantages over traditional piece-wise linear (PWL) path planning methods, showing reduced trajectory deviations, enhanced robustness, and improved overall path quality. These benefits are validated through simulations using a Pure-Pursuit controller in representative scenarios, highlighting the practical effectiveness and scalability of our approach for safe navigation.
☆ SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
☆ A Humanoid Visual-Tactile-Action Dataset for Contact-Rich Manipulation
Contact-rich manipulation has become increasingly important in robot learning. However, previous studies on robot learning datasets have focused on rigid objects and underrepresented the diversity of pressure conditions for real-world manipulation. To address this gap, we present a humanoid visual-tactile-action dataset designed for manipulating deformable soft objects. The dataset was collected via teleoperation using a humanoid robot equipped with dexterous hands, capturing multi-modal interactions under varying pressure conditions. This work also motivates future research on models with advanced optimization strategies capable of effectively leveraging the complexity and diversity of tactile signals.
☆ A Survey on Collaborative SLAM with 3D Gaussian Splatting
This survey comprehensively reviews the evolving field of multi-robot collaborative Simultaneous Localization and Mapping (SLAM) using 3D Gaussian Splatting (3DGS). As an explicit scene representation, 3DGS has enabled unprecedented real-time, high-fidelity rendering, ideal for robotics. However, its use in multi-robot systems introduces significant challenges in maintaining global consistency, managing communication, and fusing data from heterogeneous sources. We systematically categorize approaches by their architecture -- centralized, distributed -- and analyze core components like multi-agent consistency and alignment, communication-efficient, Gaussian representation, semantic distillation, fusion and pose optimization, and real-time scalability. In addition, a summary of critical datasets and evaluation metrics is provided to contextualize performance. Finally, we identify key open challenges and chart future research directions, including lifelong mapping, semantic association and mapping, multi-model for robustness, and bridging the Sim2Real gap.
♻ ☆ FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation
Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.
comment: Webpage: https://fieldgen.github.io/
♻ ☆ Performance evaluation of a ROS2 based Automated Driving System
Automated driving is currently a prominent area of scientific work. In the future, highly automated driving and new Advanced Driver Assistance Systems will become reality. While Advanced Driver Assistance Systems and automated driving functions for certain domains are already commercially available, ubiquitous automated driving in complex scenarios remains a subject of ongoing research. Contrarily to single-purpose Electronic Control Units, the software for automated driving is often executed on high performance PCs. The Robot Operating System 2 (ROS2) is commonly used to connect components in an automated driving system. Due to the time critical nature of automated driving systems, the performance of the framework is especially important. In this paper, a thorough performance evaluation of ROS2 is conducted, both in terms of timeliness and error rate. The results show that ROS2 is a suitable framework for automated driving systems.
comment: Published and presented at VEHITS 2024, Proceedings of the 10th International Conference on Vehicle Technology and Intelligent Transport Systems - VEHITS; 2024
♻ ☆ Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
comment: 16 pages
♻ ☆ Robust Point Cloud Reinforcement Learning via PCA-Based Canonicalization
Reinforcement Learning (RL) from raw visual input has achieved impressive successes in recent years, yet it remains fragile to out-of-distribution variations such as changes in lighting, color, and viewpoint. Point Cloud Reinforcement Learning (PC-RL) offers a promising alternative by mitigating appearance-based brittleness, but its sensitivity to camera pose mismatches continues to undermine reliability in realistic settings. To address this challenge, we propose PCA Point Cloud (PPC), a canonicalization framework specifically tailored for downstream robotic control. PPC maps point clouds under arbitrary rigid-body transformations to a unique canonical pose, aligning observations to a consistent frame, thereby substantially decreasing viewpoint-induced inconsistencies. In our experiments, we show that PPC improves robustness to unseen camera poses across challenging robotic tasks, providing a principled alternative to domain randomization.
♻ ☆ Procedural Generation of Articulated Simulation-Ready Assets
We introduce Infinigen-Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.
comment: Updated to include information on newly implemented assets, new experimental results (both simulation and real world), and additional features including material and dynamics parameters
♻ ☆ Concurrent-Allocation Task Execution for Multi-Robot Path-Crossing-Minimal Navigation in Obstacle Environments
Reducing undesirable path crossings among trajectories of different robots is vital in multi-robot navigation missions, which not only reduces detours and conflict scenarios, but also enhances navigation efficiency and boosts productivity. Despite recent progress in multi-robot path-crossing-minimal (MPCM) navigation, the majority of approaches depend on the minimal squared-distance reassignment of suitable desired points to robots directly. However, if obstacles occupy the passing space, calculating the actual robot-point distances becomes complex or intractable, which may render the MPCM navigation in obstacle environments inefficient or even infeasible. In this paper, the concurrent-allocation task execution (CATE) algorithm is presented to address this problem (i.e., MPCM navigation in obstacle environments). First, the path-crossing-related elements in terms of (i) robot allocation, (ii) desired-point convergence, and (iii) collision and obstacle avoidance are encoded into integer and control barrier function (CBF) constraints. Then, the proposed constraints are used in an online constrained optimization framework, which implicitly yet effectively minimizes the possible path crossings and trajectory length in obstacle environments by minimizing the desired point allocation cost and slack variables in CBF constraints simultaneously. In this way, the MPCM navigation in obstacle environments can be achieved with flexible spatial orderings. Note that the feasibility of solutions and the asymptotic convergence property of the proposed CATE algorithm in obstacle environments are both guaranteed, and the calculation burden is also reduced by concurrently calculating the optimal allocation and the control input directly without the path planning process.
comment: Accepted in IEEE Transactions on Robotics
♻ ☆ Robot Cell Modeling via Exploratory Robot Motions: A Novel and Accessible Data-Driven Approach
Generating a collision-free robot motion is crucial for safe applications in real-world settings. This requires an accurate model of all obstacle shapes within the constrained robot cell, which is particularly challenging and time-consuming. The difficulty is heightened in flexible production lines, where the environment model must be updated each time the robot cell is modified. Furthermore, sensor-based methods often necessitate costly hardware and calibration procedures and can be influenced by environmental factors (e.g., light conditions or reflections). To address these challenges, we present a novel data-driven approach to modeling a cluttered workspace, leveraging solely the robot internal joint encoders to capture exploratory motions. By computing the corresponding swept volume (SV), we generate a (conservative) mesh of the environment that is subsequently used for collision checking within established path planning and control methods. Our method significantly reduces the complexity and cost of classical environment modeling by removing the need for computer-aided design (CAD) files and external sensors. We validate the approach with the KUKA LBR iisy collaborative robot in a pick-and-place scenario. In less than three minutes of exploratory robot motions and less than four additional minutes of computation time, we obtain an accurate model that enables collision-free motions. Our approach is intuitive and easy to use, making it accessible to users without specialized technical knowledge. It is applicable to all types of industrial robots or cobots.
comment: 8 pages, 9 figures
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion NeurIPS 2025
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
comment: NeurIPS 2025
♻ ☆ Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views NeurIPS 2025
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
comment: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset: https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
♻ ☆ Acoustic Neural 3D Reconstruction Under Pose Drift IROS
We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.
comment: 8 pages, 8 figures. This paper is accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Autonomous Horizon-based Asteroid Navigation With Observability-constrained Maneuvers
Small body exploration is a pertinent challenge due to low gravity environments and strong sensitivity to perturbations like Solar Radiation Pressure (SRP). Thus, autonomous methods are being developed to enable safe navigation and control around small bodies. These methods often involve using Optical Navigation (OpNav) to determine the spacecraft's location. Ensuring OpNav reliability would allow the spacecraft to maintain an accurate state estimate throughout its mission. This research presents an observability-constrained Lyapunov controller that steers a spacecraft to a desired target orbit while guaranteeing continuous OpNav observability. We design observability path constraints to avoid regions where horizon-based OpNav methods exhibit poor performance, ensuring control input that maintains good observability. This controller is implemented with a framework that simulates small body dynamics, synthetic image generation, edge detection, horizon-based OpNav, and filtering. We evaluate the approach in two representative scenarios, orbit maintenance and approach with circularization, around spherical and ellipsoidal target bodies. In Monte Carlo simulations, the proposed approach improves the rate of attaining target orbits without observability violations by up to 94% compared to an unconstrained Lyapunov baseline, demonstrating improved robustness over conventional methods.
comment: 52 pages, 18 figures, published in the Journal of the Astronautical Sciences
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ DynaFlow: Dynamics-embedded Flow Matching for Physically Consistent Motion Generation from State-only Demonstrations
This paper introduces DynaFlow, a novel framework that embeds a differentiable simulator directly into a flow matching model. By generating trajectories in the action space and mapping them to dynamically feasible state trajectories via the simulator, DynaFlow ensures all outputs are physically consistent by construction. This end-to-end differentiable architecture enables training on state-only demonstrations, allowing the model to simultaneously generate physically consistent state trajectories while inferring the underlying action sequences required to produce them. We demonstrate the effectiveness of our approach through quantitative evaluations and showcase its real-world applicability by deploying the generated actions onto a physical Go1 quadruped robot. The robot successfully reproduces diverse gait present in the dataset, executes long-horizon motions in open-loop control and translates infeasible kinematic demonstrations into dynamically executable, stylistic behaviors. These hardware experiments validate that DynaFlow produces deployable, highly effective motions on real-world hardware from state-only demonstrations, effectively bridging the gap between kinematic data and real-world execution.
comment: 8 pages
♻ ☆ HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment
Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environment is critical for autonomous robots. While egocentric observations from RGB-D sensors provide abundant local information, real-time top-down maps offer valuable global context for ObjNav. Nevertheless, the majority of existing studies focus on a single source, seldom integrating these two complementary perceptual modalities, despite the fact that humans naturally attend to both. With the rapid advancement of Vision-Language Models(VLMs), we propose Hybrid Perception Navigation (HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding capabilities to jointly perceive both local and global information to enhance the effectiveness and intelligence of navigation in unknown environments. In both massive simulation evaluation and real-world validation, our methods achieved state-of-the-art performance against popular baselines. Benefiting from hybrid perception approach, our method captures richer cues and finds the objects more effectively, by simultaneously leveraging information understanding from egocentric observations and the top-down map. Our ablation study further proved that either of the hybrid perception contributes to the navigation performance.
comment: under review
♻ ☆ GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving NeurIPS2025
Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.
comment: Accepted at NeurIPS2025 (Spotlight)
♻ ☆ Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
comment: 2025.10.28 version v2 for TwistEstimator
♻ ☆ Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model IROS 2025
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
comment: Accepted at IROS 2025. Project page: https://vita-epfl.github.io/DFI-OmniStereo-website/
♻ ☆ GRS: Generating Robotic Simulation Tasks from Real-World Images
We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.
♻ ☆ SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving
End-to-end autonomous driving has emerged as a promising paradigm for achieving robust and intelligent driving policies. However, existing end-to-end methods still face significant challenges, such as suboptimal decision-making in complex scenarios. In this paper,we propose SimpleVSF (Simple VLM-Scoring Fusion), a novel framework that enhances end-to-end planning by leveraging the cognitive capabilities of Vision-Language Models (VLMs) and advanced trajectory fusion techniques. We utilize the conventional scorers and the novel VLM-enhanced scorers. And we leverage a robust weight fusioner for quantitative aggregation and a powerful VLM-based fusioner for qualitative, context-aware decision-making. As the leading approach in the ICCV 2025 NAVSIM v2 End-to-End Driving Challenge, our SimpleVSF framework demonstrates state-of-the-art performance, achieving a superior balance between safety, comfort, and efficiency.
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ EquiContact: A Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-rich Tasks
This paper presents a framework for learning vision-based robotic policies for contact-rich manipulation tasks that generalize spatially across task configurations. We focus on achieving robust spatial generalization of the policy for the peg-in-hole (PiH) task trained from a small number of demonstrations. We propose EquiContact, a hierarchical policy composed of a high-level vision planner (Diffusion Equivariant Descriptor Field, Diff-EDF) and a novel low-level compliant visuomotor policy (Geometric Compliant ACT, G-CompACT). G-CompACT operates using only localized observations (geometrically consistent error vectors (GCEV), force-torque readings, and wrist-mounted RGB images) and produces actions defined in the end-effector frame. Through these design choices, we show that the entire EquiContact pipeline is SE(3)-equivariant, from perception to force control. We also outline three key components for spatially generalizable contact-rich policies: compliance, localized policies, and induced equivariance. Real-world experiments on PiH, screwing, and surface wiping tasks demonstrate a near-perfect success rate and robust generalization to unseen spatial configurations, validating the proposed framework and principles. The experimental videos can be found on the project website: https://sites.google.com/berkeley.edu/equicontact
comment: Submitted to RA-L
♻ ☆ Quantum Machine Learning and Grover's Algorithm for Quantum Optimization of Robotic Manipulators
Optimizing high-degree of freedom robotic manipulators requires searching complex, high-dimensional configuration spaces, a task that is computationally challenging for classical methods. This paper introduces a quantum native framework that integrates quantum machine learning with Grover's algorithm to solve kinematic optimization problems efficiently. A parameterized quantum circuit is trained to approximate the forward kinematics model, which then constructs an oracle to identify optimal configurations. Grover's algorithm leverages this oracle to provide a quadratic reduction in search complexity. Demonstrated on simulated 1-DoF, 2-DoF, and dual-arm manipulator tasks, the method achieves significant speedups-up to 93x over classical optimizers like Nelder Mead as problem dimensionality increases. This work establishes a foundational, quantum-native framework for robot kinematic optimization, effectively bridging quantum computing and robotics problems.
♻ ☆ CAT-RRT: Motion Planning that Admits Contact One Link at a Time
Current motion planning approaches rely on binary collision checking to evaluate the validity of a state and thereby dictate where the robot is allowed to move. This approach leaves little room for robots to engage in contact with an object, as is often necessary when operating in densely cluttered spaces. In this work, we propose an alternative method that considers contact states as high-cost states that the robot should avoid but can traverse if necessary to complete a task. More specifically, we introduce Contact Admissible Transition-based Rapidly exploring Random Trees (CAT-RRT), a planner that uses a novel per-link cost heuristic to find a path by traversing high-cost obstacle regions. Through extensive testing, we find that state-of-the-art optimization planners tend to over-explore low-cost states, which leads to slow and inefficient convergence to contact regions. Conversely, CAT-RRT searches both low and high-cost regions simultaneously with an adaptive thresholding mechanism carried out at each robot link. This leads to paths with a balance between efficiency, path length, and contact cost.
♻ ☆ Efficient Path Planning and Task Allocation Algorithm for Boolean Specifications
This paper presents a novel path-planning and task assignment algorithm for multi-robot systems that should fulfill a global Boolean specification. The proposed method is based on Integer Linear Programming (ILP) formulations, which are combined with structural insights from Petri nets to improve scalability and computational efficiency. By proving that the \emph{constraint matrix} is totally unimodular (TU) for certain classes of problems, the ILP formulation can be relaxed into a Linear Programming (LP) problem without losing the integrality of the solution. This relaxation eliminates complex combinatorial techniques, significantly reducing computational overhead and thus ensuring scalability for large-scale systems. Using the approach proposed in this paper, we can solve path-planning problems for teams made up to 500 robots. The method guarantees computational tractability, handles collision avoidance and reduces computational demands through iterative LP optimization techniques. Case studies demonstrate the efficiency of the algorithm in generating scalable, collision-free paths for large robot teams navigating in complex environments. While the conservative nature of collision avoidance introduces additional constraints, and thus, computational requirements, the solution remains practical and impactful for diverse applications. The algorithm is particularly applicable to real-world scenarios, including warehouse logistics where autonomous robots must efficiently coordinate tasks or search-and-rescue operations in various environments. This work contributes both theoretically and practically to scalable multi-robot path planning and task allocation, offering an efficient framework for coordinating autonomous agents in shared environments.
♻ ☆ Federated Deep Reinforcement Learning for Privacy-Preserving Robotic-Assisted Surgery
The integration of Reinforcement Learning (RL) into robotic-assisted surgery (RAS) holds significant promise for advancing surgical precision, adaptability, and autonomous decision-making. However, the development of robust RL models in clinical settings is hindered by key challenges, including stringent patient data privacy regulations, limited access to diverse surgical datasets, and high procedural variability. To address these limitations, this paper presents a Federated Deep Reinforcement Learning (FDRL) framework that enables decentralized training of RL models across multiple healthcare institutions without exposing sensitive patient information. A central innovation of the proposed framework is its dynamic policy adaptation mechanism, which allows surgical robots to select and tailor patient-specific policies in real-time, thereby ensuring personalized and Optimised interventions. To uphold rigorous privacy standards while facilitating collaborative learning, the FDRL framework incorporates secure aggregation, differential privacy, and homomorphic encryption techniques. Experimental results demonstrate a 60\% reduction in privacy leakage compared to conventional methods, with surgical precision maintained within a 1.5\% margin of a centralized baseline. This work establishes a foundational approach for adaptive, secure, and patient-centric AI-driven surgical robotics, offering a pathway toward clinical translation and scalable deployment across diverse healthcare environments.
comment: 11 pages, 7 figures, conference
♻ ☆ On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
Computer Vision and Pattern Recognition 172
☆ Generative View Stitching
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
comment: Project website: https://andrewsonga.github.io/gvs
☆ Uniform Discrete Diffusion with Metric Path for Video Generation
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
comment: 19 pages, 10 figures
☆ Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
☆ Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NeurIPS 2025
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
comment: Accepted as a Spotlight at NeurIPS 2025
☆ MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
☆ SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
comment: Website: https://kan32501.github.io/sage.github.io/
☆ Group Relative Attention Guidance for Image Editing
Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.
☆ Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology
Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.
comment: 16 pages, 9 figures, submitted to Nature Scientific Data
☆ A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.
☆ GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization
In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird's-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at https://github.com/dcmlr/groundloc.
☆ Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT
Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.
comment: 8 pages, 6 figures
☆ OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.
☆ Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.
☆ Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
☆ Fast and accurate neural reflectance transformation imaging through knowledge distillation
Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...
comment: 18 pages
☆ Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
☆ Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.
☆ A Critical Study towards the Detection of Parkinsons Disease using ML Technologies
The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
☆ SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
☆ Deeply-Conditioned Image Compression via Self-Generated Priors
Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image's structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.
☆ XAI Evaluation Framework for Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.
☆ 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon
The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ A Hybrid Approach for Visual Multi-Object Tracking
This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2
comment: This work has been submitted to the IEEE for possible publication
☆ GenTrack: A New Generation of Multi-Object Tracking
This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
comment: This work has been submitted to the IEEE for possible publication
☆ Unsupervised Detection of Post-Stroke Brain Abnormalities
Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.
☆ When are radiology reports useful for training medical image classifiers?
Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
☆ A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.
☆ Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool
Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textit{StrokeSeg}, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \texttt{Float16} quantisation, reducing model size by about 50\%. \textit{StrokeSeg} provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference $<10^{-3}$), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.
☆ Decoupling What to Count and Where to See for Referring Expression Counting
Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
☆ Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation
Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.
comment: The paper is under review at Pattern Recognition Journal
☆ NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation
We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system's ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at https://youtu.be/tTiIQt6nXC8
comment: 9 pages, 10 figures
☆ Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
☆ What do vision-language models see in the context? Investigating multimodal in-context learning
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
☆ Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
☆ ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
☆ Training-free Source Attribution of AI-generated Images via Resynthesis
Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.
comment: 14 pages, 4 figures, 1 table, accepted at "The 17th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS2025)", Perth, Australia
☆ DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation NeurIPS 2025
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
comment: Accepted to NeurIPS 2025
☆ UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation NeurIPS 2025
Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ DeshadowMamba: Deshadowing as 1D Sequential Similarity
Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba's input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.
☆ Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy NeurIPS 2025
To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration -- an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector's feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.
comment: NeurIPS 2025
☆ Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation BMVC
Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .
comment: Accepted in British Machine Vision Conference (BMVC) 2025, Main Conference
☆ SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs NeurIPS 2025
Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.
comment: NeurIPS 2025
☆ Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization
Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbf{ID\textsuperscript{2}Face}, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript{2}Face outperforms existing methods in visual quality, identity suppression, and utility preservation.
☆ MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration
While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.
☆ CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation BMVC
Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/
comment: The 36th British Machine Vision Conference (BMVC) 2025
☆ Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 NeurIPS 2025
Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.
comment: Accepted by NeurIPS 2025
☆ Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning IROS 2025
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.
comment: RoboSense Challenge with IROS 2025
☆ MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images
Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.
☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
comment: Accepted by EMNLP 2025
☆ Compositional Image Synthesis with Inference-Time Scaling
Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
comment: projcet page: https://github.com/gcl-inha/ReFocus
☆ ETC: training-free diffusion models acceleration with Error-aware Trend Consistency
Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.
comment: 17 pages, 10 figures
☆ DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery
We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.
comment: 19 pages
☆ UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations
Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge
comment: 14 pages, 4 figures
☆ ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring
End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.
☆ Enhancing Pre-trained Representation Classifiability can Boost its Interpretability ICLR 2025
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.
comment: ICLR 2025 (Spotlight)
☆ OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
comment: The first two authors contributed equally to this work. The last two authors are co-corresponding authors
☆ Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
☆ Enhancing CLIP Robustness via Cross-Modality Alignment NeurIPS 2025
Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
comment: NeurIPS 2025 Spotlight
☆ Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.
☆ ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning
Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.
comment: 3 pages, 5 figures, 1 table
☆ AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts ICCV 2025
Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
comment: Accepted by ICCV 2025
☆ Listening without Looking: Modality Bias in Audio-Visual Captioning
Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
comment: under review
☆ Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks NeurIPS 2025
Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.
comment: Accepted at NeurIPS 2025
☆ Towards the Automatic Segmentation, Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge
The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.
☆ AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization
Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.
☆ TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
☆ Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints
Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.
comment: 9 pages
☆ Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM$_1$, PM$_{2.5}$, PM$_{10}$), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.
☆ Reasoning Visual Language Model for Chest X-Ray Analysis
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
comment: NV-Reason-CXR-3B
☆ SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
comment: 42 pages, 9 figures
☆ Neural USD: An object-centric framework for iterative editing and control
Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the "Neural Universal Scene Descriptor" or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .
comment: 22 pages, 16 figures, 1 table
☆ Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8 SP
Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
comment: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey
☆ Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission
Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.
☆ FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning
Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.
☆ SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
☆ IBIS: A Powerful Hybrid Architecture for Human Activity Recognition
The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.
comment: 8 pages. 8 figures. Wireless Days Conference, December 2025
☆ Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.
☆ Understanding Multi-View Transformers ICCV 2025
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
comment: Presented at the ICCV 2025 E2E3D Workshop
☆ VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos
Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .
comment: 19 pages, 9 figures
☆ Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation
Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and (c) traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.
☆ Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS
This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.
comment: Submitted to Int. Conf. on Computer Vision Theory and Applications (VISAPP 2026)
☆ FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6\% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture
comment: Sidharth Rai, Rahul Harsha Cheppally contributed equally to this work
☆ Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
comment: https://github.com/alexmartin1722/mirage
☆ The Generation Phases of Flow Matching: a Denoising Perspective
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
☆ MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition
Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.
comment: The paper will be published in the MMAsia2025 conference proceedings
☆ Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
comment: 18 pages, 5 figures
☆ SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
☆ Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.
♻ ☆ ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
comment: Accepted to Neurips 2025
♻ ☆ MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness NeurIPS 2025
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
comment: Accepted to NeurIPS 2025 DB Track
♻ ☆ VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree NeurIPS 2025
Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.
comment: NeurIPS 2025 poster
♻ ☆ DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) including our purposed YOLOv8n-CBAM model using our annotated dataset designed for recycling. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy(~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 80% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB ), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.
comment: 8 pages, 8 figures
♻ ☆ RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.
comment: Published in Pattern Recognition
♻ ☆ Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models
Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.
♻ ☆ Polygonal network disorder and the turning distance
The turning distance is a well-studied metric for measuring the similarity between two polygons. This metric is constructed by taking an $L^p$ distance between step functions which track each shape's tangent angle of a path tracing its boundary. In this study, we introduce \textit{turning disorders} for polygonal planar networks, defined by averaging turning distances between network faces and "ordered" shapes (regular polygons or circles). We derive closed-form expressions of turning distances for special classes of regular polygons, related to the divisibility of $m$ and $n$, and also between regular polygons and circles. These formulas are used to show that the time for computing the 2-turning distances reduces to $O((m+n) \log(m+n))$ when both shapes are regular polygons, an improvement from $O(mn\log(mn))$ operations needed to compute distances between general polygons of $n$ and $m$ sides. We also apply these formulas to several examples of network microstructure with varying disorder. For Archimedean lattices, a class of regular tilings, we can express turning disorders with exact expressions. We also consider turning disorders applied to two examples of stochastic processes on networks: spring networks evolving under T1 moves and polygonal rupture processes. We find that the two aspects of defining different turning disorders, the choice of ordered shape and whether to apply area-weighting, can capture different notions of network disorder.
♻ ☆ Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
comment: https://github.com/VITA-MLLM/Long-VITA
♻ ☆ Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning ICCV2025
So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code is available at https://github.com/aqeeelmirza/CoMet
comment: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV2025)
♻ ☆ Superpowering Open-Vocabulary Object Detectors for X-ray Vision ICCV 2025
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.
comment: Accepted at ICCV 2025
♻ ☆ DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
Accurate brain tumor segmentation is significant for clinical diagnosis and treatment but remains challenging due to tumor heterogeneity. Mamba-based State Space Models have demonstrated promising performance. However, despite their computational efficiency over other neural architectures, they incur considerable overhead for this task due to their sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address this, we first propose a dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that improves robustness. We further propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 1.16% for tumor core and 1.68% for enhancing tumor over existing state-of-the-art. Furthermore, our model achieves a 15x efficiency improvement while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing methods.
♻ ☆ PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer NeurIPS 2025
Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
comment: Accepted by NeurIPS 2025
♻ ☆ Mano Technical Report
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
♻ ☆ OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.
comment: 25 pages, 9 figures
♻ ☆ Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
comment: 16 pages
♻ ☆ LongCat-Video Technical Report
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
♻ ☆ Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.
comment: submitted on 30/4/2025, Accepted by Information Fusion
♻ ☆ DArFace: Deformation Aware Robustness for Low Quality Face Recognition
Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce \textbf{DArFace}, a \textbf{D}eformation-\textbf{A}ware \textbf{r}obust \textbf{Face} recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.
♻ ☆ MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos
Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.
♻ ☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
♻ ☆ Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey.
comment: This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey
♻ ☆ Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs NeurIPS 2025
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page: https://liuxuannan.github.io/Video-SafetyBench.github.io/
♻ ☆ Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning NeurIPS 2025
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io
comment: Accepted at NeurIPS 2025
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
comment: www.walidbousselham.com/VOLD/
♻ ☆ FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network
Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.
♻ ☆ Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation NeurIPS 2025
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar\'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr\'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
comment: Accepted to NeurIPS 2025
♻ ☆ Federated Learning with Partially Labeled Data: A Conditional Distillation Approach
In medical imaging, developing generalized segmentation models that can handle multiple organs and lesions is crucial. However, the scarcity of fully annotated datasets and strict privacy regulations present significant barriers to data sharing. Federated Learning (FL) allows decentralized model training, but existing FL methods often struggle with partial labeling, leading to model divergence and catastrophic forgetting. We propose ConDistFL, a novel FL framework incorporating conditional distillation to address these challenges. ConDistFL enables effective learning from partially labeled datasets, significantly improving segmentation accuracy across distributed and non-uniform datasets. In addition to its superior segmentation performance, ConDistFL maintains computational and communication efficiency, ensuring its scalability for real-world applications. Furthermore, ConDistFL demonstrates remarkable generalizability, significantly outperforming existing FL methods in out-of-federation tests, even adapting to unseen contrast phases (e.g., non-contrast CT images) in our experiments. Extensive evaluations on 3D CT and 2D chest X-ray datasets show that ConDistFL is an efficient, adaptable solution for collaborative medical image segmentation in privacy-constrained settings.
comment: This manuscript was submitted to IEEE JBHI and is currently under peer review
♻ ☆ Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions
Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir\'e artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moir\'e-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moir\'e patterns on deepfake detection, we conducted additional experiments using our DeepMoir\'eFake, referred to as (DMF) dataset and two synthetic Moir\'e generation techniques. Across 15 top-performing detectors, our results show that Moir\'e artifacts degrade performance by as much as 25.4%, while synthetically generated Moir\'e patterns lead to a 21.4% drop in accuracy. Surprisingly, demoir\'eing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moir\'e distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.
comment: 48 Pages, 29 Figures, 15 Tables
♻ ☆ On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
♻ ☆ CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic
Recent advances in computational pathology have led to the emergence of numerous foundation models. These models typically rely on general-purpose encoders with multi-instance learning for whole slide image (WSI) classification or apply multimodal approaches to generate reports directly from images. However, these models cannot emulate the diagnostic approach of pathologists, who systematically examine slides at low magnification to obtain an overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. Instead, existing models directly output final diagnoses without revealing the underlying reasoning process. To address this gap, we introduce CPathAgent, an innovative agent-based approach that mimics pathologists' diagnostic workflow by autonomously navigating across WSI based on observed visual features, thereby generating substantially more transparent and interpretable diagnostic summaries. To achieve this, we develop a multi-stage training strategy that unifies patch-level, region-level, and WSI-level capabilities within a single model, which is essential for replicating how pathologists understand and reason across diverse image scales. Additionally, we construct PathMMU-HR2, the first expert-validated benchmark for large region analysis. This represents a critical intermediate scale between patches and whole slides, reflecting a key clinical reality where pathologists typically examine several key large regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for computational pathology.
comment: 52 pages, 34 figures
♻ ☆ Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem ICCV 2025
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
comment: Accepted for publication at the ICCV 2025 workshop - STREAM
♻ ☆ AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
♻ ☆ Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views NeurIPS 2025
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
comment: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset: https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
♻ ☆ A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation NeurIPS 2025
Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.
comment: NeurIPS 2025
♻ ☆ Acoustic Neural 3D Reconstruction Under Pose Drift IROS
We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.
comment: 8 pages, 8 figures. This paper is accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness
Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, which suffer from low imperceptibility. Although optimization can improve stealthiness, optimizing sample-wise triggers significantly increases computational cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes point clouds into local patches and employs a curvature-based imperceptibility score to guide trigger injection into visually less sensitive patches. By optimizing a unified patch-wise trigger that perturbs spectral features of selected patches, SPBA significantly enhances optimization efficiency while maintaining high stealthiness. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA surpasses prior state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods. The code is available at https://github.com/HazardFY/SPBA.
comment: 13 pages, 6 figures, 11 tables
♻ ☆ Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images ICCV 2025
Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
comment: Accepted to ICCV 2025. Code is available at: https://github.com/QuIIL/ICCV2025_Ano-NAViLa
♻ ☆ Does CLIP perceive art the same way we do?
CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP ICIP 2025
Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at .
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications & Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code is available at
♻ ☆ CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.
comment: IEEE TMM 2025
♻ ☆ LiDAR Remote Sensing Meets Weak Supervision: Concepts, Methods, and Perspectives
Light detection and ranging (LiDAR) remote sensing encompasses two major directions: data interpretation and parameter inversion. However, both directions rely heavily on costly and labor-intensive labeled data and field measurements, which constrains their scalability and spatiotemporal adaptability. Weakly Supervised Learning (WSL) provides a unified framework to address these limitations. This paper departs from the traditional view that treats interpretation and inversion as separate tasks and offers a systematic review of recent advances in LiDAR remote sensing from a unified WSL perspective. We cover typical WSL settings including incomplete supervision(e.g., sparse point labels), inexact supervision (e.g., scene-level tags), inaccurate supervision (e.g., noisy labels), and cross-domain supervision (e.g., domain adaptation/generalization) and corresponding techniques such as pseudo-labeling, consistency regularization, self-training, and label refinement, which collectively enable robust learning from limited and weak annotations.We further analyze LiDAR-specific challenges (e.g., irregular geometry, data sparsity, domain heterogeneity) that require tailored weak supervision, and examine how sparse LiDAR observations can guide joint learning with other remote-sensing data for continuous surface-parameter retrieval. Finally, we highlight future directions where WSL acts as a bridge between LiDAR and foundation models to leverage large-scale multimodal datasets and reduce labeling costs, while also enabling broader WSL-driven advances in generalization, open-world adaptation, and scalable LiDAR remote sensing.
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
comment: 10 pages
♻ ☆ UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework
Infrared and visible image fusion has emerged as a prominent research area in computer vision. However, little attention has been paid to the fusion task in complex scenes, leading to sub-optimal results under interference. To fill this gap, we propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse. Specifically, we classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity. Maintaining a balance between interference removal and detail preservation is essential for the generalization capacity of the proposed method. Therefore, we propose an adaptive denoising strategy for the fusion of detail layers. Meanwhile, we fuse the energy features from different modalities by analyzing them from multiple directions. Extensive fusion experiments on real and synthetic complex scenes datasets cover adverse weather conditions, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection, and depth estimation, consistently indicate the superiority of the proposed method compared with the recent representative methods. Our code is available at https://github.com/ixilai/UMCFuse.
comment: Published in IEEE-TIP 2025
♻ ☆ GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning ACM MM 2025
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.
comment: Accepted at ACM MM 2025 (also known as GEMeX-ThinkVG)
♻ ☆ Global urban visual perception varies across demographics and personalities
Understanding people's preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and personality traits -- shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators -- safe, lively, wealthy, beautiful, boring, depressing -- and four new ones -- live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
♻ ☆ Task-Agnostic Fusion of Time Series and Imagery for Earth Observation
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
♻ ☆ GRASP: Geospatial pixel Reasoning viA Structured Policy learning
Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54\% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.
comment: 15 pages, 9 figures
♻ ☆ Switchable Token-Specific Codebook Quantization For Face Image Compression NeurIPS 2025
With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
comment: NeurIPS 2025 accepted
♻ ☆ Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.
comment: Project Page: \url{https://isee-laboratory.github.io/Long-RVOS}
♻ ☆ GS4: Generalizable Sparse Splatting Semantic SLAM
Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, we propose to optimize GS for only 1-5 iterations that corrects drift and floaters when significant pose changes are detected. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.
comment: 17 pages, 6 figures
♻ ☆ Bridging the gap to real-world language-grounded visual concept learning
Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.
♻ ☆ CalFuse: Multi-Modal Continual Learning via Feature Calibration and Parameter Fusion
With the proliferation of multi-modal data in large-scale visual recognition systems, enabling models to continuously acquire knowledge from evolving data streams while preserving prior information has become increasingly critical. Class-Continual Learning (CCL) addresses this challenge by incrementally incorporating new class knowledge without revisiting historical data, making it essential for real-world big data applications. While traditional CCL methods rely solely on visual features, recent advances in Vision-Language Models (VLMs) such as CLIP demonstrate significant potential for CCL by leveraging pre-trained multi-modal knowledge. However, existing approaches face challenges in mitigating catastrophic forgetting while maintaining the cross-modal generalization capabilities of VLMs. To address these limitations, we propose CalFuse, a framework that synergizes feature Calibration with parameter Fusion to enable effective multi-modal knowledge integration in continual learning scenarios. CalFuse introduces a dynamic feature calibration mechanism that adaptively balances original CLIP visual representations with task-specific features, preserving the model's intrinsic cross-modal generalization while adapting to new classes. Concurrently, a QR decomposition-based parameter fusion strategy progressively integrates newly acquired knowledge with historical task parameters, maintaining equilibrium between learning new class representations and retaining prior knowledge across sequential tasks. Extensive experiments on benchmark datasets validate the effectiveness of our approach in large-scale multi-modal continual learning settings, demonstrating superior performance over state-of-the-art methods in both average accuracy and final task retention.
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
comment: 26 pages, 14 figures
♻ ☆ IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
comment: https://github.com/lifuguan/IGGT_official
♻ ☆ VSA: Faster Video Diffusion with Trainable Sparse Attention
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
comment: Accepted by Neurips 2025
♻ ☆ Riemannian-Geometric Fingerprints of Generative Models ICCV 2025
Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.
comment: ICCV 2025 Highlight paper
♻ ☆ InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention NeurIPS 2025
Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.
comment: Accepted in NeurIPS 2025
♻ ☆ MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition NeurIPS 2025
Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive'' codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.
comment: Accepted by NeurIPS 2025
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ Unveiling Concept Attribution in Diffusion Models
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
♻ ☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
♻ ☆ From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes NeurIPS 2025
3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best-performing models, Google Gemini-2.5-Pro and OpenAI o3, achieve just around 30% accuracy on space-level tasks and around 40% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scenes beyond object-level semantics.
comment: Update v3 of the NeurIPS 2025 Datasets and Benchmarks paper (v2), including additional evaluations of state-of-the-art multimodal large language models. Project page: https://anywhere-3d.github.io/
♻ ☆ Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion
Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation. Additionally, we propose a hierarchical feature-guided denoising module. This model significantly enriches the model's capacity for learning and interpreting depth distribution by fully leveraging image features to guide the denoising process. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function serves to enhance the performance of the model and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI, Make3D, and our self-collected SIMIT datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.
♻ ☆ Navigation with VLM framework: Towards Going to Any Language
Navigating towards fully open language goals and exploring open scenes in an intelligent way have always raised significant challenges. Recently, Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data. Although many works have focused on leveraging VLMs for navigation in open scenes, they often require high computational cost, rely on object-centric approaches, or depend on environmental priors in detailed human instructions. We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively, even for human-friendly language goal such as abstract places, actions, or specific objects in open scenes. NavVLM leverages the VLM as its cognitive core to perceive environmental information and constantly provides exploration guidance achieving intelligent navigation with only a neat target rather than a detailed instruction with environment prior. We evaluated and validated NavVLM in both simulation and real-world experiments. In simulation, our framework achieves state-of-the-art performance in Success weighted by Path Length (SPL) on object-specifc tasks in richly detailed environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and Gibson. With navigation episode reported, NavVLM demonstrates the capabilities to navigate towards any open-set languages. In real-world validation, we validated our framework's effectiveness in real-world robot at indoor scene.
comment: under review
♻ ☆ GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving NeurIPS2025
Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.
comment: Accepted at NeurIPS2025 (Spotlight)
♻ ☆ PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
♻ ☆ Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
comment: 2025.10.28 version v2 for TwistEstimator
♻ ☆ Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model IROS 2025
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
comment: Accepted at IROS 2025. Project page: https://vita-epfl.github.io/DFI-OmniStereo-website/
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
♻ ☆ GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\unicode{x2014}$a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy$\unicode{x2014}$compared to 57.6% on a dataset generated by recent work. Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found $\href{this https URL}{here}$.
comment: 22 pages, 3 figures, 3 tables, project page: https://ke7.github.io/graid/
♻ ☆ MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.
♻ ☆ Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.
♻ ☆ SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
♻ ☆ Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.
comment: author comment: This version was previously removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. The authors have now obtained the necessary permissions, and the paper is resubmitted accordingly
♻ ☆ Hyperparameters in Continual Learning: A Reality Check
Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
comment: TMLR 2025 camera ready version
♻ ☆ Cyst-X: A Federated AI System Outperforms Clinical Guidelines to Detect Pancreatic Cancer Precursors and Reduce Unnecessary Surgery
Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we developed Cyst-X, an AI framework for IPMN risk prediction trained on a unique, multi-center dataset of 1,461 MRI scans from 764 patients. Cyst-X achieves significantly higher accuracy (AUC = 0.82) than both the established Kyoto guidelines (AUC = 0.75) and expert radiologists, particularly in correct identification of high-risk lesions. Clinically, this translates to a 20% increase in cancer detection sensitivity (87.8% vs. 64.1%) for high-risk lesions. We demonstrate that this performance is maintained in a federated learning setting, allowing for collaborative model training without compromising patient privacy. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset and models, providing the first large-scale, multi-center MRI resource for pancreatic cyst analysis.
♻ ☆ PSScreen V2: Partially Supervised Multiple Retinal Disease Screening
In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at https://github.com/boyiZheng99/PSScreen_V2.
♻ ☆ Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness
Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and outperform the base CNN architecture by 9.3% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 6.2% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.
♻ ☆ VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
comment: 29 pages, 40 figures, 3 algorithms
♻ ☆ Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
comment: Version v1.1
Artificial Intelligence 256
☆ Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NeurIPS 2025
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
comment: Accepted as a Spotlight at NeurIPS 2025
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.
☆ Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
☆ Greedy Sampling Is Provably Efficient for RLHF NeurIPS 2025
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
comment: NeurIPS 2025
☆ ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ Repurposing Synthetic Data for Fine-grained Search Agent Supervision
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
☆ Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning NeurIPS 2025
We present a framework for uncovering and exploiting dependencies among tools and documents to enhance exemplar artifact generation. Our method begins by constructing a tool knowledge graph from tool schemas,including descriptions, arguments, and output payloads, using a DeepResearch-inspired analysis. In parallel, we derive a complementary knowledge graph from internal documents and SOPs, which is then fused with the tool graph. To generate exemplar plans, we adopt a deep-sparse integration strategy that aligns structural tool dependencies with procedural knowledge. Experiments demonstrate that this unified framework effectively models tool interactions and improves plan generation, underscoring the benefits of linking tool graphs with domain knowledge graphs for tool-augmented reasoning and planning.
comment: 4 pages, 2 figures, short paper, NeurIPS 2025 workshop on Bridging Language, Agent, and World Models for Reasoning and Planning
☆ Fast algorithms enabling optimization and deep learning for photoacoustic tomography in a circular detection geometry
The inverse source problem arising in photoacoustic tomography and in several other coupled-physics modalities is frequently solved by iterative algorithms. Such algorithms are based on the minimization of a certain cost functional. In addition, novel deep learning techniques are currently being investigated to further improve such optimization approaches. All such methods require multiple applications of the operator defining the forward problem, and of its adjoint. In this paper, we present new asymptotically fast algorithms for numerical evaluation of the forward and adjoint operators, applicable in the circular acquisition geometry. For an $(n \times n)$ image, our algorithms compute these operators in $\mathcal{O}(n^2 \log n)$ floating point operations. We demonstrate the performance of our algorithms in numerical simulations, where they are used as an integral part of several iterative image reconstruction techniques: classic variational methods, such as non-negative least squares and total variation regularized least squares, as well as deep learning methods, such as learned primal dual. A Python implementation of our algorithms and computational examples is available to the general public.
☆ Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor
comment: 15 pages, 9 figures
☆ Learning to Drive Safely with Hybrid Options
Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
☆ Multi-Agent Scenario Generation in Roundabouts with a Transformer-enhanced Conditional Variational Autoencoder
With the increasing integration of intelligent driving functions into serial-produced vehicles, ensuring their functionality and robustness poses greater challenges. Compared to traditional road testing, scenario-based virtual testing offers significant advantages in terms of time and cost efficiency, reproducibility, and exploration of edge cases. We propose a Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) model for generating multi-agent traffic scenarios in roundabouts, which are characterized by high vehicle dynamics and complex layouts, yet remain relatively underexplored in current research. The results show that the proposed model can accurately reconstruct original scenarios and generate realistic, diverse synthetic scenarios. Besides, two Key-Performance-Indicators (KPIs) are employed to evaluate the interactive behavior in the generated scenarios. Analysis of the latent space reveals partial disentanglement, with several latent dimensions exhibiting distinct and interpretable effects on scenario attributes such as vehicle entry timing, exit timing, and velocity profiles. The results demonstrate the model's capability to generate scenarios for the validation of intelligent driving functions involving multi-agent interactions, as well as to augment data for their development and iterative improvement.
☆ InteractComp: Evaluating Search Agents With Ambiguous Queries
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
☆ OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs
Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR training. Experiments show that the dataset presents a challenging but solvable benchmark, and the proposed reward is effective when combined with GRPO-style algorithms, highlighting the importance of leveraging topological structure and data complexity in multi-turn tool use.
comment: 9 pages, 4 figures
☆ Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning
Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to submit papers, code, or datasets.
comment: 26 pages, 8 figures, and 2 tables
☆ FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling
Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted model training, isolation of tool architecture, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models, outperforming most close-source models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.
☆ The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets NeurIPS 2025
We study the parameter complexity of robust memorization for $\mathrm{ReLU}$ networks: the number of parameters required to interpolate any given dataset with $\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\mu$-ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\rho = \mu / \epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\rho \in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\rho$ is small, but grows with increasing $\rho$.
comment: Accepted to NeurIPS 2025, 72 pages, 8 figures
☆ Causal Ordering for Structure Learning From Time Series
Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}\!3-\!6$ variables, $T{=}200\!-\!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20\!-\!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.
comment: 32 pages
☆ All in one timestep: Enhancing Sparsity and Energy efficiency in Multi-level Spiking Neural Networks
Spiking Neural Networks (SNNs) are one of the most promising bio-inspired neural networks models and have drawn increasing attention in recent years. The event-driven communication mechanism of SNNs allows for sparse and theoretically low-power operations on dedicated neuromorphic hardware. However, the binary nature of instantaneous spikes also leads to considerable information loss in SNNs, resulting in accuracy degradation. To address this issue, we propose a multi-level spiking neuron model able to provide both low-quantization error and minimal inference latency while approaching the performance of full precision Artificial Neural Networks (ANNs). Experimental results with popular network architectures and datasets, show that multi-level spiking neurons provide better information compression, allowing therefore a reduction in latency without performance loss. When compared to binary SNNs on image classification scenarios, multi-level SNNs indeed allow reducing by 2 to 3 times the energy consumption depending on the number of quantization intervals. On neuromorphic data, our approach allows us to drastically reduce the inference latency to 1 timestep, which corresponds to a compression factor of 10 compared to previously published results. At the architectural level, we propose a new residual architecture that we call Sparse-ResNet. Through a careful analysis of the spikes propagation in residual connections we highlight a spike avalanche effect, that affects most spiking residual architectures. Using our Sparse-ResNet architecture, we can provide state-of-the-art accuracy results in image classification while reducing by more than 20% the network activity compared to the previous spiking ResNets.
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
comment: 12 Pages
☆ DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.
☆ LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.
☆ Generative AI for Healthcare: Fundamentals, Challenges, and Perspectives
Generative Artificial Intelligence (GenAI) is taking the world by storm. It promises transformative opportunities for advancing and disrupting existing practices, including healthcare. From large language models (LLMs) for clinical note synthesis and conversational assistance to multimodal systems that integrate medical imaging, electronic health records, and genomic data for decision support, GenAI is transforming the practice of medicine and the delivery of healthcare, such as diagnosis and personalized treatments, with great potential in reducing the cognitive burden on clinicians, thereby improving overall healthcare delivery. However, GenAI deployment in healthcare requires an in-depth understanding of healthcare tasks and what can and cannot be achieved. In this paper, we propose a data-centric paradigm in the design and deployment of GenAI systems for healthcare. Specifically, we reposition the data life cycle by making the medical data ecosystem as the foundational substrate for generative healthcare systems. This ecosystem is designed to sustainably support the integration, representation, and retrieval of diverse medical data and knowledge. With effective and efficient data processing pipelines, such as semantic vector search and contextual querying, it enables GenAI-powered operations for upstream model components and downstream clinical applications. Ultimately, it not only supplies foundation models with high-quality, multimodal data for large-scale pretraining and domain-specific fine-tuning, but also serves as a knowledge retrieval backend to support task-specific inference via the agentic layer. The ecosystem enables the deployment of GenAI for high-quality and effective healthcare delivery.
☆ Quantum-Resistant Networks Using Post-Quantum Cryptography
Quantum networks rely on both quantum and classical channels for coordinated operation. Current architectures employ entanglement distribution and key exchange over quantum channels but often assume that classical communication is sufficiently secure. In practice, classical channels protected by traditional cryptography remain vulnerable to quantum adversaries, since large-scale quantum computers could break widely used public-key schemes and reduce the effective security of symmetric cryptography. This perspective presents a quantum-resistant network architecture that secures classical communication with post-quantum cryptographic techniques while supporting entanglement-based communication over quantum channels. Beyond cryptographic protection, the framework incorporates continuous monitoring of both quantum and classical layers, together with orchestration across heterogeneous infrastructures, to ensure end-to-end security. Collectively, these mechanisms provide a pathway toward scalable, robust, and secure quantum networks that remain dependable against both classical and quantum-era threats.
comment: Submission for 2025 IEEE Workshop on Quantum IntelLigence, Learning & Security (QUILLS), https://sites.google.com/view/quills2025/home
☆ From Cross-Task Examples to In-Task Prompts: A Graph-Based Pseudo-Labeling Framework for In-context Learning
The capability of in-context learning (ICL) enables large language models (LLMs) to perform novel tasks without parameter updates by conditioning on a few input-output examples. However, collecting high-quality examples for new or challenging tasks can be costly and labor-intensive. In this work, we propose a cost-efficient two-stage pipeline that reduces reliance on LLMs for data labeling. Our approach first leverages readily available cross-task examples to prompt an LLM and pseudo-label a small set of target task instances. We then introduce a graph-based label propagation method that spreads label information to the remaining target examples without additional LLM queries. The resulting fully pseudo-labeled dataset is used to construct in-task demonstrations for ICL. This pipeline combines the flexibility of cross-task supervision with the scalability of LLM-free propagation. Experiments across five tasks demonstrate that our method achieves strong performance while lowering labeling costs.
☆ Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient
Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.
☆ Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
☆ Design and Optimization of Cloud Native Homomorphic Encryption Workflows for Privacy-Preserving ML Inference
As machine learning (ML) models become increasingly deployed through cloud infrastructures, the confidentiality of user data during inference poses a significant security challenge. Homomorphic Encryption (HE) has emerged as a compelling cryptographic technique that enables computation on encrypted data, allowing predictions to be generated without decrypting sensitive inputs. However, the integration of HE within large scale cloud native pipelines remains constrained by high computational overhead, orchestration complexity, and model compatibility issues. This paper presents a systematic framework for the design and optimization of cloud native homomorphic encryption workflows that support privacy-preserving ML inference. The proposed architecture integrates containerized HE modules with Kubernetes-based orchestration, enabling elastic scaling and parallel encrypted computation across distributed environments. Furthermore, optimization strategies including ciphertext packing, polynomial modulus adjustment, and operator fusion are employed to minimize latency and resource consumption while preserving cryptographic integrity. Experimental results demonstrate that the proposed system achieves up to 3.2times inference acceleration and 40% reduction in memory utilization compared to conventional HE pipelines. These findings illustrate a practical pathway for deploying secure ML-as-a-Service (MLaaS) systems that guarantee data confidentiality under zero-trust cloud conditions.
comment: 6 pages 2 figures, 2 tABLES
☆ Online neural fusion of distortionless differential beamformers for robust speech enhancement
Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non-stationary scenarios, such as rapidly moving interference, since its adaptive updates cannot reliably track rapid changes. To overcome this limitation, we propose a frame-online neural fusion framework for multiple distortionless differential beamformers, which estimates the combination weights through a neural network. Compared with conventional ACC, the proposed method adapts more effectively to dynamic acoustic environments, achieving stronger interference suppression while maintaining the distortionless constraint.
☆ Diffusion Models for Wireless Transceivers: From Pilot-Efficient Channel Estimation to AI-Native 6G Receivers
With the development of artificial intelligence (AI) techniques, implementing AI-based techniques to improve wireless transceivers becomes an emerging research topic. Within this context, AI-based channel characterization and estimation become the focus since these methods have not been solved by traditional methods very well and have become the bottleneck of transceiver efficiency in large-scale orthogonal frequency division multiplexing (OFDM) systems. Specifically, by formulating channel estimation as a generative AI problem, generative AI methods such as diffusion models (DMs) can efficiently deal with rough initial estimations and have great potential to cooperate with traditional signal processing methods. This paper focuses on the transceiver design of OFDM systems based on DMs, provides an illustration of the potential of DMs in wireless transceivers, and points out the related research directions brought by DMs. We also provide a proof-of-concept case study of further adapting DMs for better wireless receiver performance.
comment: Submitted for potential publication in IEEE Wireless Communications
☆ A word association network methodology for evaluating implicit biases in LLMs compared to humans
As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.
comment: 24 pages, 13 figures, 3 tables
☆ Sample-efficient and Scalable Exploration in Continuous-Time RL
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
comment: 26 pages, 6 figures, 6 tables
☆ Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.
comment: 25 pages, 7 figures, 3 tables
☆ Iterative Critique-Refine Framework for Enhancing LLM Personalization
Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
☆ Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks
Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most --200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems.
☆ Affordance Representation and Recognition for Autonomous Agents
The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling} a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent's reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.
☆ Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC 2026
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
comment: 12 pages, 1 figure. Submitted to the LREC 2026 conference
☆ Rethinking Visual Intelligence: Insights from Video Pretraining
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
comment: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
☆ Law in Silico: Simulating Legal Society with LLM-Based Agents
Since real-world legal experiments are often costly or infeasible, simulating legal societies with Artificial Intelligence (AI) systems provides an effective alternative for verifying and developing legal theory, as well as supporting legal administration. Large Language Models (LLMs), with their world knowledge and role-playing capabilities, are strong candidates to serve as the foundation for legal society simulation. However, the application of LLMs to simulate legal systems remains underexplored. In this work, we introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement. Our experiments, which compare simulated crime rates with real-world data, demonstrate that LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. At the same time, micro-level simulations reveal that a well-functioning, transparent, and adaptive legal system offers better protection of the rights of vulnerable individuals.
☆ Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content NeurIPS 2025
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
comment: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
☆ Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning
Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabi\'a - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.
comment: 12 pages
☆ MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.
comment: Technical Report
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is sparse and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages LLM-generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text ("contextual chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.
comment: Preprint version submitted to the International Journal of Accounting Information Systems; currently under major revision. 20 pages, 1 figure, 1 table
☆ APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training
With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.
comment: 46 pages
☆ Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion
The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.
☆ Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents
Policy Cards are introduced as a machine-readable, deployment-layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version-controlled, and linked to runtime enforcement or continuous-audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi-agent ecosystems. Policy Cards provide a practical mechanism for integrating high-level governance with hands-on engineering practice and enabling accountable autonomy at scale.
comment: First published on 19/10/2025. Canonical archived record and DOI: 10.5281/zenodo.17391796
☆ An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine
Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi-agent ecosystem for N-of-1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low-density regions, calibration in the small, and risk--coverage trade-offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.
comment: This study has been supported by grants from the National Institutes of Health: The National Institute on Aging R01AG074372 and The National Institute of Allergy and Infectious Diseases R01AI165535
☆ Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
We introduce Perception Learning (PeL), a paradigm that optimizes an agent's sensory interface $f_\phi:\mathcal{X}\to\mathcal{Z}$ using task-agnostic signals, decoupled from downstream decision learning $g_\theta:\mathcal{Z}\to\mathcal{Y}$. PeL directly targets label-free perceptual properties, such as stability to nuisances, informativeness without collapse, and controlled geometry, assessed via objective representation-invariant metrics. We formalize the separation of perception and decision, define perceptual properties independent of objectives or reparameterizations, and prove that PeL updates preserving sufficient invariants are orthogonal to Bayes task-risk gradients. Additionally, we provide a suite of task-agnostic evaluation metrics to certify perceptual quality.
☆ LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability EMNLP
Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
comment: EMNLP Findings 2025
☆ A Unified Geometric Space Bridging AI Models and the Human Brain
For decades, neuroscientists and computer scientists have pursued a shared ambition: to understand intelligence and build it. Modern artificial neural networks now rival humans in language, perception, and reasoning, yet it is still largely unknown whether these artificial systems organize information as the brain does. Existing brain-AI alignment studies have shown the striking correspondence between the two systems, but such comparisons remain bound to specific inputs and tasks, offering no common ground for comparing how AI models with different kinds of modalities-vision, language, or multimodal-are intrinsically organized. Here we introduce a groundbreaking concept of Brain-like Space: a unified geometric space in which every AI model can be precisely situated and compared by mapping its intrinsic spatial attention topological organization onto canonical human functional brain networks, regardless of input modality, task, or sensory domain. Our extensive analysis of 151 Transformer-based models spanning state-of-the-art large vision models, large language models, and large multimodal models uncovers a continuous arc-shaped geometry within this space, reflecting a gradual increase of brain-likeness; different models exhibit distinct distribution patterns within this geometry associated with different degrees of brain-likeness, shaped not merely by their modality but by whether the pretraining paradigm emphasizes global semantic abstraction and whether the positional encoding scheme facilitates deep fusion across different modalities. Moreover, the degree of brain-likeness for a model and its downstream task performance are not "identical twins". The Brain-like Space provides the first unified framework for situating, quantifying, and comparing intelligence across domains, revealing the deep organizational principles that bridge machines and the brain.
☆ VDSAgents: A PCS-Guided Multi-Agent System for Veridical Data Science Automation
Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM-driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real-world datasets. This paper provides VDSAgents, a multi-agent system grounded in the Predictability-Computability-Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state-of-the-art end-to-end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek-V3 and GPT-4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM-driven data science automation.
comment: 29 pages, 6 figures. Yunxuan Jiang and Silan Hu contributed equally. Code available at https://github.com/fengzer/VDSAgents
☆ Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research
Generative Large Language Models (gLLMs), such as ChatGPT, are increasingly being used in communication research for content analysis. Studies show that gLLMs can outperform both crowd workers and trained coders, such as research assistants, on various coding tasks relevant to communication science, often at a fraction of the time and cost. Additionally, gLLMs can decode implicit meanings and contextual information, be instructed using natural language, deployed with only basic programming skills, and require little to no annotated data beyond a validation dataset - constituting a paradigm shift in automated content analysis. Despite their potential, the integration of gLLMs into the methodological toolkit of communication research remains underdeveloped. In gLLM-assisted quantitative content analysis, researchers must address at least seven critical challenges that impact result quality: (1) codebook development, (2) prompt engineering, (3) model selection, (4) parameter tuning, (5) iterative refinement, (6) validation of the model's reliability, and optionally, (7) performance enhancement. This paper synthesizes emerging research on gLLM-assisted quantitative content analysis and proposes a comprehensive best-practice guide to navigate these challenges. Our goal is to make gLLM-based content analysis more accessible to a broader range of communication researchers and ensure adherence to established disciplinary quality standards of validity, reliability, reproducibility, and research ethics.
☆ Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
comment: Cultural Knowledge, Everyday Knowledge, Open-Ended Question, Chain-of-Thought, Large Language Models, Native, Multilingual, Language Diversity
☆ Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
☆ Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
comment: Preprint, 25 pages, 9 figures. Code: https://github.com/WooooDyy/Critique-RL
Transformers can do Bayesian Clustering
Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
☆ Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting
Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
☆ Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank
Despite the strong reasoning ability of large language models~(LLMs), they are prone to errors and hallucinations. As a result, how to check their outputs effectively and efficiently has become a critical problem in their applications. Existing checking methods heavily rely on external resources, such as trained verifiers (e.g., process/outcome reward models) or elaborate prompts, which lead to high computational overhead and are only applicable to specific domains. In this paper, we investigate whether the internal behaviors of LLMs have already implied the credibility of their reasoning paths. Specifically, we find that the rank of the correlation matrix between the input problem and the output reasoning path is a robust indicator of reasoning correctness. Different from other correctness indicators for LLMs, the calculation of the correlation matrix only relies on the LLM itself, which avoids the hassle of training a separate model or designing complicated prompts. Based on it, we design a simple, plug-and-play Self-Indicator method to reweight candidate reasoning paths, which achieves significant performance improvements than other voting and verification methods with very few computational overhead. Our experiments across multiple LLMs of varying scales and model families have further shown the effectiveness of Self-Indicator. It achieves over 75% accuracy in distinguishing correct reasoning paths from incorrect ones, and, in turn, improves the accuracies on three reasoning benchmarks by more than 8%.
☆ Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms
One weakness of Monte Carlo Tree Search (MCTS) is its sample efficiency which can be addressed by building and using state and/or action abstractions in parallel to the tree search such that information can be shared among nodes of the same layer. The primary usage of abstractions for MCTS is to enhance the Upper Confidence Bound (UCB) value during the tree policy by aggregating visits and returns of an abstract node. However, this direct usage of abstractions does not take the case into account where multiple actions with the same parent might be in the same abstract node, as these would then all have the same UCB value, thus requiring a tiebreak rule. In state-of-the-art abstraction algorithms such as pruned On the Go Abstractions (pruned OGA), this case has not been noticed, and a random tiebreak rule was implicitly chosen. In this paper, we propose and empirically evaluate several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.
☆ ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
☆ MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow's effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents' proficiency in real-world MCP environments. MCP-Flow is publicly available at \href{https://github.com/wwh0411/MCP-Flow}{https://github.com/wwh0411/MCP-Flow}.
☆ Training-free Source Attribution of AI-generated Images via Resynthesis
Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.
comment: 14 pages, 4 figures, 1 table, accepted at "The 17th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS2025)", Perth, Australia
Survey and Tutorial of Reinforcement Learning Methods in Process Systems Engineering
Sequential decision making under uncertainty is central to many Process Systems Engineering (PSE) challenges, where traditional methods often face limitations related to controlling and optimizing complex and stochastic systems. Reinforcement Learning (RL) offers a data-driven approach to derive control policies for such challenges. This paper presents a survey and tutorial on RL methods, tailored for the PSE community. We deliver a tutorial on RL, covering fundamental concepts and key algorithmic families including value-based, policy-based and actor-critic methods. Subsequently, we survey existing applications of these RL techniques across various PSE domains, such as in fed-batch and continuous process control, process optimization, and supply chains. We conclude with PSE focused discussion of specialized techniques and emerging directions. By synthesizing the current state of RL algorithm development and implications for PSE this work identifies successes, challenges, trends, and outlines avenues for future research at the interface of these fields.
☆ DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation NeurIPS 2025
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
comment: Accepted to NeurIPS 2025
☆ Trajectory Design for UAV-Based Low-Altitude Wireless Networks in Unknown Environments: A Digital Twin-Assisted TD3 Approach
Unmanned aerial vehicles (UAVs) are emerging as key enablers for low-altitude wireless network (LAWN), particularly when terrestrial networks are unavailable. In such scenarios, the environmental topology is typically unknown; hence, designing efficient and safe UAV trajectories is essential yet challenging. To address this, we propose a digital twin (DT)-assisted training and deployment framework. In this framework, the UAV transmits integrated sensing and communication signals to provide communication services to ground users, while simultaneously collecting echoes that are uploaded to the DT server to progressively construct virtual environments (VEs). These VEs accelerate model training and are continuously updated with real-time UAV sensing data during deployment, supporting decision-making and enhancing flight safety. Based on this framework, we further develop a trajectory design scheme that integrates simulated annealing for efficient user scheduling with the twin-delayed deep deterministic policy gradient algorithm for continuous trajectory design, aiming to minimize mission completion time while ensuring obstacle avoidance. Simulation results demonstrate that the proposed approach achieves faster convergence, higher flight safety, and shorter mission completion time compared with baseline methods, providing a robust and efficient solution for LAWN deployment in unknown environments.
comment: 13 pages, 11 figures
☆ Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
Large vision-language models (LVLMs) have recently demonstrated great potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by low Earth orbit (LEO) satellites. However, their deployment in real-world LEO satellite systems remains largely unexplored, hindered by limited onboard computing resources and brief satellite-ground contacts. We propose Grace, a satellite-ground collaborative system designed for near-realtime LVLM inference in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime inference, but larger ones on ground stations (GSs) to guarantee end-to-end performance. Grace is comprised of two main phases that are asynchronous satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch algorithm. Firstly, we still the knowledge archive of GS RAG to satellite archive with tailored adaptive update algorithm during limited satellite-ground data exchange period. Secondly, propose a confidence-based test algorithm that either processes the task onboard the satellite or offloads it to the GS. Extensive experiments based on real-world satellite orbital data show that Grace reduces the average latency by 76-95% compared to state-of-the-art methods, without compromising inference accuracy.
comment: 15 pages, 11 figures
☆ MAGNET: A Multi-Graph Attentional Network for Code Clone Detection
Code clone detection is a fundamental task in software engineering that underpins refactoring, debugging, plagiarism detection, and vulnerability analysis. Existing methods often rely on singular representations such as abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs), which capture only partial aspects of code semantics. Hybrid approaches have emerged, but their fusion strategies are typically handcrafted and ineffective. In this study, we propose MAGNET, a multi-graph attentional framework that jointly leverages AST, CFG, and DFG representations to capture syntactic and semantic features of source code. MAGNET integrates residual graph neural networks with node-level self-attention to learn both local and long-range dependencies, introduces a gated cross-attention mechanism for fine-grained inter-graph interactions, and employs Set2Set pooling to fuse multi-graph embeddings into unified program-level representations. Extensive experiments on BigCloneBench and Google Code Jam demonstrate that MAGNET achieves state-of-the-art performance with an overall F1 score of 96.5\% and 99.2\% on the two datasets, respectively. Ablation studies confirm the critical contributions of multi-graph fusion and each attentional component. Our code is available at https://github.com/ZixianReid/Multigraph_match
☆ PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
☆ Closing Gaps: An Imputation Analysis of ICU Vital Signs
As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.
comment: Preprint
☆ MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
☆ SymMaP: Improving Computational Efficiency in Linear Solvers through Symbolic Preconditioning
Matrix preconditioning is a critical technique to accelerate the solution of linear systems, where performance heavily depends on the selection of preconditioning parameters. Traditional parameter selection approaches often define fixed constants for specific scenarios. However, they rely on domain expertise and fail to consider the instance-wise features for individual problems, limiting their performance. In contrast, machine learning (ML) approaches, though promising, are hindered by high inference costs and limited interpretability. To combine the strengths of both approaches, we propose a symbolic discovery framework-namely, Symbolic Matrix Preconditioning (SymMaP)-to learn efficient symbolic expressions for preconditioning parameters. Specifically, we employ a neural network to search the high-dimensional discrete space for expressions that can accurately predict the optimal parameters. The learned expression allows for high inference efficiency and excellent interpretability (expressed in concise symbolic formulas), making it simple and reliable for deployment. Experimental results show that SymMaP consistently outperforms traditional strategies across various benchmarks.
☆ MGA: Memory-Driven GUI Agent for Observation-Centric Interaction WWW2025
The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.
comment: Submitted to WWW2025
☆ UniPlanner: A Unified Motion Planning Framework for Autonomous Vehicle Decision-Making Systems via Multi-Dataset Integration
Motion planning is a critical component of autonomous vehicle decision-making systems, directly determining trajectory safety and driving efficiency. While deep learning approaches have advanced planning capabilities, existing methods remain confined to single-dataset training, limiting their robustness in planning. Through systematic analysis, we discover that vehicular trajectory distributions and history-future correlations demonstrate remarkable consistency across different datasets. Based on these findings, we propose UniPlanner, the first planning framework designed for multi-dataset integration in autonomous vehicle decision-making. UniPlanner achieves unified cross-dataset learning through three synergistic innovations. First, the History-Future Trajectory Dictionary Network (HFTDN) aggregates history-future trajectory pairs from multiple datasets, using historical trajectory similarity to retrieve relevant futures and generate cross-dataset planning guidance. Second, the Gradient-Free Trajectory Mapper (GFTM) learns robust history-future correlations from multiple datasets, transforming historical trajectories into universal planning priors. Its gradient-free design ensures the introduction of valuable priors while preventing shortcut learning, making the planning knowledge safely transferable. Third, the Sparse-to-Dense (S2D) paradigm implements adaptive dropout to selectively suppress planning priors during training for robust learning, while enabling full prior utilization during inference to maximize planning performance.
☆ BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas NeurIPS 2025
Stellar mass is a fundamental quantity that determines the properties and evolution of stars. However, estimating stellar masses in star-forming regions is challenging because young stars are obscured by dense gas and the regions are highly inhomogeneous, making spherical dynamical estimates unreliable. Supervised machine learning could link such complex structures to stellar mass, but it requires large, high-quality labeled datasets from high-resolution magneto-hydrodynamical (MHD) simulations, which are computationally expensive. We address this by pretraining a vision transformer on one million synthetic fractal images using the self-supervised framework DINOv2, and then applying the frozen model to limited high-resolution MHD simulations. Our results demonstrate that synthetic pretraining improves frozen-feature regression stellar mass predictions, with the pretrained model performing slightly better than a supervised model trained on the same limited simulations. Principal component analysis of the extracted features further reveals semantically meaningful structures, suggesting that the model enables unsupervised segmentation of star-forming regions without the need for labeled data or fine-tuning.
comment: 6 pages, 3 figures, 1 table, accepted for NeurIPS 2025 ML4PS workshop
☆ Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning IROS 2025
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.
comment: RoboSense Challenge with IROS 2025
☆ BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data
Building training-ready multi-hop question answering (QA) datasets that truly stress a model's retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems -- requiring the integration of ambiguous, indirect, and cross-domain cues -- these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions -- where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence -- incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.
☆ Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean ACL
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
comment: submitted to ACL ARR Rolling Review
☆ From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.
☆ Beyond Line-Level Filtering for the Pretraining Corpora of LLMs ACL
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
comment: submitted to ACL ARR Rolling Review
☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
comment: Accepted by EMNLP 2025
☆ Compositional Image Synthesis with Inference-Time Scaling
Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.
comment: projcet page: https://github.com/gcl-inha/ReFocus
☆ LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation
Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo's memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation. Project page: https://weekgoodday.github.io/lagmemo
☆ HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
For doctors to truly trust artificial intelligence, it can't be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide--just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn't stop there. If a doctor ever asks, "Why?", HistoLens can instantly provide a 'visual proof' for any finding--a heatmap that points to the exact cells and regions the AI used for its analysis. We've also ensured the AI focuses only on the patient's tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.
☆ Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe that in modern large-modelinference, parameters and activations routinely move backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer. These memory-driven transfers inflate tail latency andviolate Service Level Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown under contention.We then formulate NoI synthesis as a multi-objective optimization(MOO) problem. We develop PARL (Partition-Aware ReinforcementLearner), a topology generator that balances throughput, latency,and power. PARL-generated topologies reduce contention at the memory cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining competitive mean throughput relative to link-rich meshes. Overall, this reframes NoI design for heterogeneouschiplet accelerators with workload-aware objectives.
☆ Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation NeurIPS 2025
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
comment: accepted by NeurIPS 2025
☆ Learning Parameterized Skills from Demonstrations
We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
comment: Neurips 2025
☆ Modeling Electric Vehicle Car-Following Behavior: Classical vs Machine Learning Approach
The increasing adoption of electric vehicles (EVs) necessitates an understanding of their driving behavior to enhance traffic safety and develop smart driving systems. This study compares classical and machine learning models for EV car following behavior. Classical models include the Intelligent Driver Model (IDM), Optimum Velocity Model (OVM), Optimal Velocity Relative Velocity (OVRV), and a simplified CACC model, while the machine learning approach employs a Random Forest Regressor. Using a real world dataset of an EV following an internal combustion engine (ICE) vehicle under varied driving conditions, we calibrated classical model parameters by minimizing the RMSE between predictions and real data. The Random Forest model predicts acceleration using spacing, speed, and gap type as inputs. Results demonstrate the Random Forest's superior accuracy, achieving RMSEs of 0.0046 (medium gap), 0.0016 (long gap), and 0.0025 (extra long gap). Among physics based models, CACC performed best, with an RMSE of 2.67 for long gaps. These findings highlight the machine learning model's performance across all scenarios. Such models are valuable for simulating EV behavior and analyzing mixed autonomy traffic dynamics in EV integrated environments.
☆ Covert Surveillance in Smart Devices: A SCOUR Framework Analysis of Youth Privacy Implications
This paper investigates how smart devices covertly capture private conversations and discusses in more in-depth the implications of this for youth privacy. Using a structured review guided by the PRISMA methodology, the analysis focuses on privacy concerns, data capture methods, data storage and sharing practices, and proposed technical mitigations. To structure and synthesize findings, we introduce the SCOUR framework, encompassing Surveillance mechanisms, Consent and awareness, Operational data flow, Usage and exploitation, and Regulatory and technical safeguards. Findings reveal that smart devices have been covertly capturing personal data, especially with smart toys and voice-activated smart gadgets built for youth. These issues are worsened by unclear data collection practices and insufficient transparency in smart device applications. Balancing privacy and utility in smart devices is crucial, as youth are becoming more aware of privacy breaches and value their personal data more. Strategies to improve regulatory and technical safeguards are also provided. The review identifies research gaps and suggests future directions. The limitations of this literature review are also explained. The findings have significant implications for policy development and the transparency of data collection for smart devices.
comment: To appear in the IEEE UEMCON 2025 proceedings
☆ FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic NeurIPS 2025
Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the quantized backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3$\times$ training speedup over existing quantized LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON's end-to-end FP8 workflow removes the need for post-training quantization, facilitating efficient deployment. Code is available at https://github.com/iamkanghyunchoi/falqon.
comment: NeurIPS 2025
☆ PULSE: Privileged Knowledge Transfer from Electrodermal Activity to Low-Cost Sensors for Stress Monitoring ML4H 2025
Electrodermal activity (EDA), the primary signal for stress detection, requires costly hardware often unavailable in real-world wearables. In this paper, we propose PULSE, a framework that utilizes EDA exclusively during self-supervised pretraining, while enabling inference without EDA but with more readily available modalities such as ECG, BVP, ACC, and TEMP. Our approach separates encoder outputs into shared and private embeddings. We align shared embeddings across modalities and fuse them into a modality-invariant representation. The private embeddings carry modality-specific information to support the reconstruction objective. Pretraining is followed by knowledge transfer where a frozen EDA teacher transfers sympathetic-arousal representations into student encoders. On WESAD, our method achieves strong stress-detection performance, showing that representations of privileged EDA can be transferred to low-cost sensors to improve accuracy while reducing hardware cost.
comment: Accepted as a finders paper at ML4H 2025
☆ SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird's-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.
☆ Learning from History: A Retrieval-Augmented Framework for Spatiotemporal Prediction
Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system's intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system's local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model's predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
☆ Causal-Aware Generative Adversarial Networks with Reinforcement Learning
The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data's manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.
☆ Geometric Algorithms for Neural Combinatorial Optimization with Constraints
Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carath\'eodory's theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.
☆ ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning
Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.
comment: 3 pages, 5 figures, 1 table
☆ LLMLogAnalyzer: A Clustering-Based Log Analysis Chatbot using Large Language Models
System logs are a cornerstone of cybersecurity, supporting proactive breach prevention and post-incident investigations. However, analyzing vast amounts of diverse log data remains significantly challenging, as high costs, lack of in-house expertise, and time constraints make even basic analysis difficult for many organizations. This study introduces LLMLogAnalyzer, a clustering-based log analysis chatbot that leverages Large Language Models (LLMs) and Machine Learning (ML) algorithms to simplify and streamline log analysis processes. This innovative approach addresses key LLM limitations, including context window constraints and poor structured text handling capabilities, enabling more effective summarization, pattern extraction, and anomaly detection tasks. LLMLogAnalyzer is evaluated across four distinct domain logs and various tasks. Results demonstrate significant performance improvements over state-of-the-art LLM-based chatbots, including ChatGPT, ChatPDF, and NotebookLM, with consistent gains ranging from 39% to 68% across different tasks. The system also exhibits strong robustness, achieving a 93% reduction in interquartile range (IQR) when using ROUGE-1 scores, indicating significantly lower result variability. The framework's effectiveness stems from its modular architecture comprising a router, log recognizer, log parser, and search tools. This design enhances LLM capabilities for structured text analysis while improving accuracy and robustness, making it a valuable resource for both cybersecurity experts and non-technical users.
comment: 33 pages, 10 figures
☆ Improved Accuracy of Robot Localization Using 3-D LiDAR in a Hippocampus-Inspired Model
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
comment: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025
☆ OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting
Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.
☆ Spatio-temporal Multivariate Time Series Forecast with Chosen Variables
Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series of $n$ spatially distributed variables in a period of recent past to forecast their values in a period of near future. It has important applications in spatio-temporal sensing forecast such as road traffic prediction and air pollution prediction. Recent papers have addressed a practical problem of missing variables in the model input, which arises in the sensing applications where the number $m$ of sensors is far less than the number $n$ of locations to be monitored, due to budget constraints. We observe that the state of the art assumes that the $m$ variables (i.e., locations with sensors) in the model input are pre-determined and the important problem of how to choose the $m$ variables in the input has never been studied. This paper fills the gap by studying a new problem of STMF with chosen variables, which optimally selects $m$-out-of-$n$ variables for the model input in order to maximize the forecast accuracy. We propose a unified framework that jointly performs variable selection and model optimization for both forecast accuracy and model efficiency. It consists of three novel technical components: (1) masked variable-parameter pruning, which progressively prunes less informative variables and attention parameters through quantile-based masking; (2) prioritized variable-parameter replay, which replays low-loss past samples to preserve learned knowledge for model stability; (3) dynamic extrapolation mechanism, which propagates information from variables selected for the input to all other variables via learnable spatial embeddings and adjacency information. Experiments on five real-world datasets show that our work significantly outperforms the state-of-the-art baselines in both accuracy and efficiency, demonstrating the effectiveness of joint variable selection and model optimization.
comment: In submission
☆ NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.
☆ SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs
Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.
☆ Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
comment: 23pages, 4figures
☆ Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs
Recent progress in large language models (LLMs) has advanced automatic code generation, yet most approaches rely on direct, single-step translation from problem descriptions to code, disregarding structured software engineering practices. We introduce a lifecycle-aware framework that systematically incorporates intermediate artifacts such as requirements analysis, state machine modeling, and pseudocode into both the training and inference stages. This design aligns code generation with standard software development phases and enables more structured reasoning. Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning, with performance gains compounding across intermediate stages. Multi-step inference consistently surpasses single-step generation, demonstrating the effectiveness of intermediate scaffolding. Notably, open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code. When applied to DeepSeek-Coder-1.3B, our framework yields relative CodeBLEU improvements of 34.3%, 20.0%, 11.2%, and 22.3% over ChatGPT-3.5, ChatGPT-4o-mini, DeepSeek-R1, and LLaMA-8B, respectively. Our pipeline also proves robust with up to 80\% less training data, confirming its resilience. Ablation studies further reveal that each intermediate artifact contributes distinctly to final code quality, with state machine modeling yielding the most substantial impact. Our source code and detailed experimental data are available at https://anonymous.4open.science/r/Lifecycle-Aware-3CCB.
☆ Discovering Heuristics with Large Language Models (LLMs) for Mixed-Integer Programs: Single-Machine Scheduling
Our study contributes to the scheduling and combinatorial optimization literature with new heuristics discovered by leveraging the power of Large Language Models (LLMs). We focus on the single-machine total tardiness (SMTT) problem, which aims to minimize total tardiness by sequencing n jobs on a single processor without preemption, given processing times and due dates. We develop and benchmark two novel LLM-discovered heuristics, the EDD Challenger (EDDC) and MDD Challenger (MDDC), inspired by the well-known Earliest Due Date (EDD) and Modified Due Date (MDD) rules. In contrast to prior studies that employed simpler rule-based heuristics, we evaluate our LLM-discovered algorithms using rigorous criteria, including optimality gaps and solution time derived from a mixed-integer programming (MIP) formulation of SMTT. We compare their performance against state-of-the-art heuristics and exact methods across various job sizes (20, 100, 200, and 500 jobs). For instances with more than 100 jobs, exact methods such as MIP and dynamic programming become computationally intractable. Up to 500 jobs, EDDC improves upon the classic EDD rule and another widely used algorithm in the literature. MDDC consistently outperforms traditional heuristics and remains competitive with exact approaches, particularly on larger and more complex instances. This study shows that human-LLM collaboration can produce scalable, high-performing heuristics for NP-hard constrained combinatorial optimization, even under limited resources when effectively configured.
☆ Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models NeurIPS 2025
Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.
comment: Accepted at NeurIPS 2025
☆ Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks NeurIPS 2025
Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.
comment: Accepted at NeurIPS 2025
☆ Learning Individual Movement Shifts After Urban Disruptions with Social Infrastructure Reliance
Shifts in individual movement patterns following disruptive events can reveal changing demands for community resources. However, predicting such shifts before disruptive events remains challenging for several reasons. First, measures are lacking for individuals' heterogeneous social infrastructure resilience (SIR), which directly influences their movement patterns, and commonly used features are often limited or unavailable at scale, e.g., sociodemographic characteristics. Second, the complex interactions between individual movement patterns and spatial contexts have not been sufficiently captured. Third, individual-level movement may be spatially sparse and not well-suited to traditional decision-making methods for movement predictions. This study incorporates individuals' SIR into a conditioned deep learning model to capture the complex relationships between individual movement patterns and local spatial context using large-scale, sparse individual-level data. Our experiments demonstrate that incorporating individuals' SIR and spatial context can enhance the model's ability to predict post-event individual movement patterns. The conditioned model can capture the divergent shifts in movement patterns among individuals who exhibit similar pre-event patterns but differ in SIR.
☆ STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem
Operator eigenvalue problems play a critical role in various scientific fields and engineering applications, yet numerical methods are hindered by the curse of dimensionality. Recent deep learning methods provide an efficient approach to address this challenge by iteratively updating neural networks. These methods' performance relies heavily on the spectral distribution of the given operator: larger gaps between the operator's eigenvalues will improve precision, thus tailored spectral transformations that leverage the spectral distribution can enhance their performance. Based on this observation, we propose the Spectral Transformation Network (STNet). During each iteration, STNet uses approximate eigenvalues and eigenfunctions to perform spectral transformations on the original operator, turning it into an equivalent but easier problem. Specifically, we employ deflation projection to exclude the subspace corresponding to already solved eigenfunctions, thereby reducing the search space and avoiding converging to existing eigenfunctions. Additionally, our filter transform magnifies eigenvalues in the desired region and suppresses those outside, further improving performance. Extensive experiments demonstrate that STNet consistently outperforms existing learning-based methods, achieving state-of-the-art performance in accuracy.
☆ HyperGraphX: Graph Transductive Learning with Hyperdimensional Computing and Message Passing
We present a novel algorithm, \hdgc, that marries graph convolution with binding and bundling operations in hyperdimensional computing for transductive graph learning. For prediction accuracy \hdgc outperforms major and popular graph neural network implementations as well as state-of-the-art hyperdimensional computing implementations for a collection of homophilic graphs and heterophilic graphs. Compared with the most accurate learning methodologies we have tested, on the same target GPU platform, \hdgc is on average 9561.0 and 144.5 times faster than \gcnii, a graph neural network implementation and HDGL, a hyperdimensional computing implementation, respectively. As the majority of the learning operates on binary vectors, we expect outstanding energy performance of \hdgc on neuromorphic and emerging process-in-memory devices.
☆ Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models NeurIPS 2025
Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at https://github.com/aailab-kaist/DATE.
comment: Accepted at NeurIPS 2025
☆ An efficient probabilistic hardware architecture for diffusion-like models
The proliferation of probabilistic AI has promoted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.
comment: 9 pages, 6 figures
☆ The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
☆ SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
comment: 42 pages, 9 figures
☆ Neural USD: An object-centric framework for iterative editing and control
Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the "Neural Universal Scene Descriptor" or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .
comment: 22 pages, 16 figures, 1 table
☆ Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs
There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.
☆ ChessQA: Evaluating Large Language Models for Chess Understanding
Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.
comment: 33 pages,8 figures
☆ Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8 SP
Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
comment: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey
☆ StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems
Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.
comment: ArXiv version; Affiliations: Arizona State University (Lin, Zhang, Thakkar, Sun, Cao) and Iowa State University (Zheng)
☆ Towards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study
The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.
comment: Accepted at the 2025 Sixth International Conference on Intelligent Data Science Technologies and Applications (IDSTA 2025),8 pages, 4 figures. Published in IEEE
☆ Aligning Large Language Models with Procedural Rules: An Autoregressive State-Tracking Prompting for In-Game Trading
Large Language Models (LLMs) enable dynamic game interactions but fail to follow essential procedural flows in rule-governed trading systems, eroding player trust. This work resolves the core tension between the creative flexibility of LLMs and the procedural demands of in-game trading (browse-offer-review-confirm). To this end, Autoregressive State-Tracking Prompting (ASTP) is introduced, a methodology centered on a strategically orchestrated prompt that compels an LLM to make its state-tracking process explicit and verifiable. Instead of relying on implicit contextual understanding, ASTP tasks the LLM with identifying and reporting a predefined state label from the previous turn. To ensure transactional integrity, this is complemented by a state-specific placeholder post-processing method for accurate price calculations. Evaluation across 300 trading dialogues demonstrates >99% state compliance and 99.3% calculation precision. Notably, ASTP with placeholder post-processing on smaller models (Gemini-2.5-Flash) matches larger models' (Gemini-2.5-Pro) performance while reducing response time from 21.2s to 2.4s, establishing a practical foundation that satisfies both real-time requirements and resource constraints of commercial games.
comment: 8 pages main content, 18 pages supplementary material, 4 figures
☆ Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
comment: 9 pages, 10 figures
☆ Taming the Real-world Complexities in CPT E/M Coding with Large Language Models EMNLP 2025
Evaluation and Management (E/M) coding, under the Current Procedural Terminology (CPT) taxonomy, documents medical services provided to patients by physicians. Used primarily for billing purposes, it is in physicians' best interest to provide accurate CPT E/M codes. %While important, it is an auxiliary task that adds to physicians' documentation burden. Automating this coding task will help alleviate physicians' documentation burden, improve billing efficiency, and ultimately enable better patient care. However, a number of real-world complexities have made E/M encoding automation a challenging task. In this paper, we elaborate some of the key complexities and present ProFees, our LLM-based framework that tackles them, followed by a systematic evaluation. On an expert-curated real-world dataset, ProFees achieves an increase in coding accuracy of more than 36\% over a commercial CPT E/M coding system and almost 5\% over our strongest single-prompt baseline, demonstrating its effectiveness in addressing the real-world complexities.
comment: EMNLP 2025 Industry Track
☆ Cyclic Counterfactuals under Shift-Scale Interventions NeurIPS 2025
Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable's mechanism.
comment: Accepted at NeurIPS 2025
☆ Epileptic Seizure Detection and Prediction from EEG Data: A Machine Learning Approach with Clinical Validation
In recent years, machine learning has become an increasingly powerful tool for supporting seizure detection and monitoring in epilepsy care. Traditional approaches focus on identifying seizures only after they begin, which limits the opportunity for early intervention and proactive treatment. In this study, we propose a novel approach that integrates both real-time seizure detection and prediction, aiming to capture subtle temporal patterns in EEG data that may indicate an upcoming seizure. Our approach was evaluated using the CHB-MIT Scalp EEG Database, which includes 969 hours of recordings and 173 seizures collected from 23 pediatric and young adult patients with drug-resistant epilepsy. To support seizure detection, we implemented a range of supervised machine learning algorithms, including K-Nearest Neighbors, Logistic Regression, Random Forest, and Support Vector Machine. The Logistic Regression achieved 90.9% detection accuracy with 89.6% recall, demonstrating balanced performance suitable for clinical screening. Random Forest and Support Vector Machine models achieved higher accuracy (94.0%) but with 0% recall, failing to detect any seizures, illustrating that accuracy alone is insufficient for evaluating medical ML models with class imbalance. For seizure prediction, we employed Long Short-Term Memory (LSTM) networks, which use deep learning to model temporal dependencies in EEG data. The LSTM model achieved 89.26% prediction accuracy. These results highlight the potential of developing accessible, real-time monitoring tools that not only detect seizures as traditionally done, but also predict them before they occur. This ability to predict seizures marks a significant shift from reactive seizure management to a more proactive approach, allowing patients to anticipate seizures and take precautionary measures to reduce the risk of injury or other complications.
comment: 9 pages, 3 figures
☆ FaRAccel: FPGA-Accelerated Defense Architecture for Efficient Bit-Flip Attack Resilience in Transformer Models
Forget and Rewire (FaR) methodology has demonstrated strong resilience against Bit-Flip Attacks (BFAs) on Transformer-based models by obfuscating critical parameters through dynamic rewiring of linear layers. However, the application of FaR introduces non-negligible performance and memory overheads, primarily due to the runtime modification of activation pathways and the lack of hardware-level optimization. To overcome these limitations, we propose FaRAccel, a novel hardware accelerator architecture implemented on FPGA, specifically designed to offload and optimize FaR operations. FaRAccel integrates reconfigurable logic for dynamic activation rerouting, and lightweight storage of rewiring configurations, enabling low-latency inference with minimal energy overhead. We evaluate FaRAccel across a suite of Transformer models and demonstrate substantial reductions in FaR inference latency and improvement in energy efficiency, while maintaining the robustness gains of the original FaR methodology. To the best of our knowledge, this is the first hardware-accelerated defense against BFAs in Transformers, effectively bridging the gap between algorithmic resilience and efficient deployment on real-world AI platforms.
comment: Accepted By ICCD 2025
☆ LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies
Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.
☆ FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning
Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.
☆ Hammering the Diagnosis: Rowhammer-Induced Stealthy Trojan Attacks on ViT-Based Medical Imaging
Vision Transformers (ViTs) have emerged as powerful architectures in medical image analysis, excelling in tasks such as disease detection, segmentation, and classification. However, their reliance on large, attention-driven models makes them vulnerable to hardware-level attacks. In this paper, we propose a novel threat model referred to as Med-Hammer that combines the Rowhammer hardware fault injection with neural Trojan attacks to compromise the integrity of ViT-based medical imaging systems. Specifically, we demonstrate how malicious bit flips induced via Rowhammer can trigger implanted neural Trojans, leading to targeted misclassification or suppression of critical diagnoses (e.g., tumors or lesions) in medical scans. Through extensive experiments on benchmark medical imaging datasets such as ISIC, Brain Tumor, and MedMNIST, we show that such attacks can remain stealthy while achieving high attack success rates about 82.51% and 92.56% in MobileViT and SwinTransformer, respectively. We further investigate how architectural properties, such as model sparsity, attention weight distribution, and the number of features of the layer, impact attack effectiveness. Our findings highlight a critical and underexplored intersection between hardware-level faults and deep learning security in healthcare applications, underscoring the urgent need for robust defenses spanning both model architectures and underlying hardware platforms.
comment: Accepted, ICCD 2025
☆ Sequences of Logits Reveal the Low Rank Structure of Language Models
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
☆ SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
☆ Finding Culture-Sensitive Neurons in Vision-Language Models
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
comment: 22 pages, 13 figures
☆ KAN-GCN: Combining Kolmogorov-Arnold Network with Graph Convolution Network for an Accurate Ice Sheet Emulator NeurIPS 2025
We introduce KAN-GCN, a fast and accurate emulator for ice sheet modeling that places a Kolmogorov-Arnold Network (KAN) as a feature-wise calibrator before graph convolution networks (GCNs). The KAN front end applies learnable one-dimensional warps and a linear mixing step, improving feature conditioning and nonlinear encoding without increasing message-passing depth. We employ this architecture to improve the performance of emulators for numerical ice sheet models. Our emulator is trained and tested using 36 melting-rate simulations with 3 mesh-size settings for Pine Island Glacier, Antarctica. Across 2- to 5-layer architectures, KAN-GCN matches or exceeds the accuracy of pure GCN and MLP-GCN baselines. Despite a small parameter overhead, KAN-GCN improves inference throughput on coarser meshes by replacing one edge-wise message-passing layer with a node-wise transform; only the finest mesh shows a modest cost. Overall, KAN-first designs offer a favorable accuracy vs. efficiency trade-off for large transient scenario sweeps.
comment: Accept for NeurIPS 2025 Workshop: New Perspectives in Graph Machine Learning
☆ Trust Dynamics in Strategic Coopetition: Computational Foundations for Requirements Engineering in Multi-Agent Systems
Requirements engineering increasingly occurs in multi-stakeholder environments where organizations simultaneously cooperate and compete, creating coopetitive relationships in which trust evolves dynamically based on observed behavior over repeated interactions. While conceptual modeling languages like i* represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi-agent systems provide algorithmic updating but lack grounding in requirements engineering contexts and conceptual models. This technical report bridges this gap by developing a computational trust model that extends game-theoretic foundations for strategic coopetition with dynamic trust evolution. We introduce trust as a two-layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling requirements engineers to instantiate computational trust models from i* dependency networks and organizational contexts. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49 out of 60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods. This technical report builds upon its foundational companion work in arXiv:2510.18802.
comment: 62 pages, 20 figures, This technical report is the second in a research program and should be read in conjunction with its foundational companion work arXiv:2510.18802. It builds on the frameworks established in that prior work and also adapts and extends material on trustworthiness first presented in the doctoral dissertation 'Modeling Strategic Coopetition' (Pant, 2021, University of Toronto)
☆ Understanding Multi-View Transformers ICCV 2025
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
comment: Presented at the ICCV 2025 E2E3D Workshop
☆ Fair Indivisible Payoffs through Shapley Value
We consider the problem of payoff division in indivisible coalitional games, where the value of the grand coalition is a natural number. This number represents a certain quantity of indivisible objects, such as parliamentary seats, kidney exchanges, or top features contributing to the outcome of a machine learning model. The goal of this paper is to propose a fair method for dividing these objects among players. To achieve this, we define the indivisible Shapley value and study its properties. We demonstrate our proposed technique using three case studies, in particular, we use it to identify key regions of an image in the context of an image classification task.
☆ Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems
The growing integration of artificial intelligence (AI) into human cognition raises a fundamental question: does AI merely improve efficiency, or does it alter how we think? This study experimentally tested whether short-term exposure to narrow AI tools enhances core cognitive abilities or simply optimizes task performance. Thirty young adults completed standardized neuropsychological assessments embedded in a seven-week protocol with a four-week online intervention involving problem-solving and verbal comprehension tasks, either with or without AI support (ChatGPT). While AI-assisted participants completed several tasks faster and more accurately, no significant pre-post differences emerged in standardized measures of problem solving or verbal comprehension. These results demonstrate efficiency gains without cognitive change, suggesting that current narrow AI systems serve as cognitive scaffolds extending performance without transforming underlying mental capacities. The findings highlight the need for ethical and educational frameworks that promote critical and autonomous thinking in an increasingly AI-augmented cognitive ecology.
comment: 30 pages, 8 figures. Preprint submitted for peer review (not yet accepted or published)
☆ Scheduling Your LLM Reinforcement Learning with Reasoning Trees
Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
☆ The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) -- a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes -- Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic & Semantic Stability, and Persona/Role Continuity -- and explains why current architectures systematically fail to support them. Case analyses (Character.AI, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.
comment: 35 pages, 127 references
☆ The Generation Phases of Flow Matching: a Denoising Perspective
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
☆ Do Chatbots Walk the Talk of Responsible AI?
This study examines whether leading AI chatbot companies implement the responsible AI principles they publicly advocate. The authors used a mixed-methods approach analyzing four major chatbots (ChatGPT, Gemini, DeepSeek, and Grok) across company websites, technical documentation, and direct chatbot evaluations. We found significant gaps between corporate rhetoric and practice.
☆ Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
comment: 18 pages, 5 figures
☆ SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
♻ ☆ Memory Mosaics at scale NeurIPS 2025
Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications ("Memory Mosaics v2"), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.
comment: Oral @ NeurIPS 2025
♻ ☆ Retrieval-Augmented Generation-based Relation Extraction
Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.
comment: published at the Semantic Web journal. The last version is available: https://doi.org/10.1177/22104968251385519
♻ ☆ ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
comment: Accepted to Neurips 2025
♻ ☆ Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
comment: 8 pages for main body, 19 pages in total
♻ ☆ FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation
Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.
comment: Webpage: https://fieldgen.github.io/
♻ ☆ Reproducible workflow for online AI in digital health
Online artificial intelligence (AI) algorithms are an important component of digital health interventions. These online algorithms are designed to continually learn and improve their performance as streaming data is collected on individuals. Deploying online AI presents a key challenge: balancing adaptability of online AI with reproducibility. Online AI in digital interventions is a rapidly evolving area, driven by advances in algorithms, sensors, software, and devices. Digital health intervention development and deployment is a continuous process, where implementation - including the AI decision-making algorithm - is interspersed with cycles of re-development and optimization. Each deployment informs the next, making iterative deployment a defining characteristic of this field. This iterative nature underscores the importance of reproducibility: data collected across deployments must be accurately stored to have scientific utility, algorithm behavior must be auditable, and results must be comparable over time to facilitate scientific discovery and trustworthy refinement. This paper proposes a reproducible scientific workflow for developing, deploying, and analyzing online AI decision-making algorithms in digital health interventions. Grounded in practical experience from multiple real-world deployments, this workflow addresses key challenges to reproducibility across all phases of the online AI algorithm development life-cycle.
♻ ☆ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
♻ ☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
comment: 25 pages
♻ ☆ CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora
Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside their pre-training distribution. To address thi we introduce CustomIR, a framework for unsupervised adaptation of pre-trained language embedding models to domain-specific corpora using synthetically generated query-document pairs. CustomIR leverages large language models (LLMs) to create diverse queries grounded in a known target corpus, paired with LLM-verified hard negatives, eliminating the need for costly human annotation. Experiments on enterprise email and messaging datasets show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10. This performance increase allows these small models to rival the performance of much larger alternatives, allowing for cheaper RAG deployments. These results highlight that targeted synthetic fine-tuning offers a scalable and cost-efficient strategy for increasing domain-specific performance.
♻ ☆ Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining
Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in offering correctness guarantees for LLMs, but it faces major challenges in CDP: testing data often stems from unknown or shifting domain distributions, under which CP may no longer provide valid guarantees. Moreover, when high coverage is required, CP can yield excessively large prediction sets for unanswerable queries, reducing informativeness. To address these challenges, we introduce an adaptive rejection and non-exchangeable CP framework. Our method first estimates the distribution of questions across domains in the test set using transformer-based clustering, then reweights or resamples the calibration data accordingly. Building on this, adaptive rejection CP allows the LLM to selectively abstain from answering when its confidence or competence shifts significantly. Extensive experiments demonstrate that our framework enhances both the effectiveness and reliability of CP under CDP scenarios. Our code is available at: https://anonymous.4open.science/r/CPCL-8C12/
♻ ☆ GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection NeurIPS 2025
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior
While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
comment: aamas 2025
♻ ☆ TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
♻ ☆ Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.
comment: Submitted to Journal of Artificial Intelligence Research (JAIR)
♻ ☆ Group-in-Group Policy Optimization for LLM Agent Training NeurIPS 2025
Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
comment: NeurIPS 2025
♻ ☆ Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms
This paper introduces a broad class of Mirror Descent (MD) and Generalized Exponentiated Gradient (GEG) algorithms derived from trace-form entropies defined via deformed logarithms. Leveraging these generalized entropies yields MD \& GEG algorithms with improved convergence behavior, robustness to vanishing and exploding gradients, and inherent adaptability to non-Euclidean geometries through mirror maps. We establish deep connections between these methods and Amari's natural gradient, revealing a unified geometric foundation for additive, multiplicative, and natural gradient updates. Focusing on the Tsallis, Kaniadakis, Sharma--Taneja--Mittal, and Kaniadakis--Lissia--Scarfone entropy families, we show that each entropy induces a distinct Riemannian metric on the parameter space, leading to GEG algorithms that preserve the natural statistical geometry. The tunable parameters of deformed logarithms enable adaptive geometric selection, providing enhanced robustness and convergence over classical Euclidean optimization. Overall, our framework unifies key first-order MD optimization methods under a single information-geometric perspective based on generalized Bregman divergences, where the choice of entropy determines the underlying metric and dual geometric structure.
comment: 22 pages, 9 figures
♻ ☆ Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.
comment: Under review
♻ ☆ OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.
comment: 25 pages, 9 figures
♻ ☆ BNMusic: Blending Environmental Noises into Personalized Music NeurIPS 2025
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.
comment: This paper has been accepted by NeurIPS 2025
♻ ☆ The Formalism-Implementation Gap in Reinforcement Learning Research
The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at "super-human levels". This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks -- thereby rendering them less useful -- which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered "saturated", can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.
♻ ☆ Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine
Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.
♻ ☆ The human-machine paradox: how collaboration creates or destroys value, and why augmentation is key to resolving it
When deploying artificial skills, managers widely assume that combining them with the human factor is a safe harbor, mitigating the risks of full automation in high-complexity tasks. This paper formally challenges the economic validity of this widespread assumption, arguing that the true bottom-line economic utility of a human-machine skill policy is dangerously misunderstood and highly contingent on situational and design factors. To investigate this gap, we develop an in-silico framework based on Monte Carlo simulations grounded in empirical pragmatism to quantify the economic impact of human and machine skills in the execution of tasks presenting varying levels of complexity. Our results show that a human-machine strategy can yield the highest economic utility in complex scenarios, but only if genuine augmentation is achieved. In contrast, when failing to realize this synergy, the human-machine approach can perform worse than either the machine-exclusive or the human-exclusive policy, actively destroying value under the pressure of costs that are not compensated by sufficient performance gains. The takeaway for decision-makers is unambiguous: when the context is complex and critical, simply allocating human and machine skills to a task may be insufficient, and far from being a silver-bullet solution or a low-risk compromise. Rather, it is a critical opportunity to boost competitiveness that demands a strong organizational commitment to enabling augmentation. Also, our findings show that improving the cost-effectiveness of machine skills over time, while useful, does not replace the fundamental need to focus on achieving augmentation.
comment: Research report (v4, 90+ pages) is now repackaged into a draft paper (v5, 18 pages); results unchanged; extended considerations; improved layout and readability
♻ ☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
♻ ☆ Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research
Large Language Models (LLMs) are being increasingly used as synthetic agents in social science, in applications ranging from augmenting survey responses to powering multi-agent simulations. This paper outlines cautions that should be taken when interpreting LLM outputs and proposes a pragmatic reframing for the social sciences in which LLMs are used as high-capacity pattern matchers for quasi-predictive interpolation under explicit scope conditions and not as substitutes for probabilistic inference. Practical guardrails such as independent draws, preregistered human baselines, reliability-aware validation, and subgroup calibration, are introduced so that researchers may engage in useful prototyping and forecasting while avoiding category errors.
♻ ☆ Cross-Scenario Unified Modeling of User Interests at Billion Scale
User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery. Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving. We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks. We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.
comment: https://github.com/ariesssxu/RedSeqRec
♻ ☆ Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.
comment: 10 pages, preprint of Journal paper
♻ ☆ Diffusion Models Meet Contextual Bandits
Efficient online decision-making in contextual bandits is challenging, as methods without informative priors often suffer from computational or statistical inefficiencies. In this work, we leverage pre-trained diffusion models as expressive priors to capture complex action dependencies and develop a practical algorithm that efficiently approximates posteriors under such priors, enabling both fast updates and sampling. Empirical results demonstrate the effectiveness and versatility of our approach across diverse contextual bandit settings.
comment: Neurips 2025
♻ ☆ Provable Scaling Laws for the Test-Time Compute of Large Language Models NeurIPS 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
comment: NeurIPS 2025 camera-ready version
♻ ☆ Querying Inconsistent Prioritized Data with ORBITS: Algorithms, Implementation, and Experiments KR 2022
We investigate practical algorithms for inconsistency-tolerant query answering over prioritized knowledge bases, which consist of a logical theory, a set of facts, and a priority relation between conflicting facts. We consider three well-known semantics (AR, IAR and brave) based upon two notions of optimal repairs (Pareto and completion). Deciding whether a query answer holds under these semantics is (co)NP-complete in data complexity for a large class of logical theories, and SAT-based procedures have been devised for repair-based semantics when there is no priority relation, or the relation has a special structure. The present paper introduces the first SAT encodings for Pareto- and completion-optimal repairs w.r.t. general priority relations and proposes several ways of employing existing and new encodings to compute answers under (optimal) repair-based semantics, by exploiting different reasoning modes of SAT solvers. The comprehensive experimental evaluation of our implementation compares both (i) the impact of adopting semantics based on different kinds of repairs, and (ii) the relative performances of alternative procedures for the same semantics.
comment: This is an extended version of a paper appearing at the 19th International Conference on Principles of Knowledge Representation and Reasoning (KR 2022). 122 pages. This version gives an optimized version of the encodings for non-binary conflicts (appendix B.3)
♻ ☆ Mining Large Independent Sets on Massive Graphs
The Maximum Independent Set problem is fundamental for extracting conflict-free structure from large graphs, with applications in scheduling, recommendation, and network analysis. However, existing heuristics can stagnate when search schedules are fixed and information from past solutions is underused, leading to wasted effort in low-quality regions of the search space. We present ARCIS, an efficient algorithm for mining large independent sets on massive graphs. ARCIS couples two main components. The first is an adaptive restart policy that refreshes exploration when progress slows. The second is Consensus-Guided Vertex Fixing, which restricts the search to the non-consensus region of the graph by fixing vertices consistently observed within a round. The consensus is maintained as a running intersection within each round, and because it is recomputed at every restart, the fixing is reversible. Vertices that later lose support are automatically unfixed and their neighborhoods re-enter the working graph, which corrects occasional mistakes while preserving progress. Experiments on 222 graphs from four benchmark suites show that ARCIS attains the best or tied-best solution quality in most instances while delivering competitive runtime and low variability. Ablation studies isolate the impact of each component, indicating that ARCIS is a practical and robust method for large-scale graph mining.
comment: 17 pages, 3 figures
♻ ☆ Offline Learning and Forgetting for Reasoning with Large Language Models
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025. Code: https://github.com/twni2016/llm-reasoning-uft
♻ ☆ Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem
In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (``Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then ``frozen'' and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware. On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100\%$ for 4 city cases, $90\%$ for 5 city cases and $80\%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20\%$, revealing the onset of scalability limitations of the proposed method. The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start'' initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network
Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.
♻ ☆ Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
♻ ☆ LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
♻ ☆ Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified framework for enhancing benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, featuring corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. All review data, user study materials, and enhanced benchmarks are publicly released.
♻ ☆ The Confidence Paradox: Can LLM Know When It's Wrong AACL 2025
Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses, especially under uncertainty. Existing models like LayoutLMv3, UDOP, and DONUT focus on accuracy but lack ethical calibration. We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning. We introduce two new metrics Honesty Score (H-Score) and Ethical Confidence Index (ECI)-to evaluate ethical alignment. HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. It also generalizes well across domains, achieving 78.9% accuracy and 76.1% F1-score.
comment: Accepted at the 14th IJCNLP & 4th AACL 2025 (Main)
♻ ☆ Thermometry of simulated Bose--Einstein condensates using machine learning
Precise determination of thermodynamic parameters in ultracold Bose gases remains challenging due to the destructive nature of conventional measurement techniques and inherent experimental uncertainties. We demonstrate a machine learning approach for rapid, non-destructive estimation of the chemical potential and temperature from a single image of an \emph{in situ} imaged density profiles of finite-temperature Bose gases. Our convolutional neural network is trained exclusively on quasi-2D `pancake' condensates in harmonic trap configurations. It achieves parameter extraction within fractions of a second. The model also demonstrates {some} zero-shot generalisation across both trap geometry and thermalisation dynamics, successfully estimating the temperature (although not the chemical potential) for toroidally trapped condensates with errors of only a few nanokelvin despite no prior exposure to such geometries during training, and maintaining predictive accuracy during dynamic thermalisation processes after a relatively brief evolution without explicit training on non-equilibrium states. These results suggest that supervised learning can overcome traditional limitations in ultracold atom thermometry, with extension to broader geometric configurations, temperature ranges, and additional parameters potentially enabling comprehensive real-time analysis of quantum gas experiments. Such capabilities could significantly streamline experimental workflows whilst improving measurement precision across a range of quantum fluid systems.
♻ ☆ On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.
♻ ☆ The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics
In recent years, the expressive power of various neural architectures -- including graph neural networks (GNNs), transformers, and recurrent neural networks -- has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These provide us with the first results on the logical expressiveness of temporal GNNs.
♻ ☆ MixAT: Combining Continuous and Discrete Adversarial Training for LLMs NeurIPS 2025
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
comment: Published at 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment NeurIPS 2025
How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
comment: NeurIPS 2025
♻ ☆ Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem ICCV 2025
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
comment: Accepted for publication at the ICCV 2025 workshop - STREAM
♻ ☆ Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set
Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.
♻ ☆ Do Language Models Use Their Depth Efficiently? NeurIPS 2025
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
comment: Accepted to NeurIPS 2025
♻ ☆ A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation NeurIPS 2025
Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.
comment: NeurIPS 2025
♻ ☆ Scalable Exploration via Ensemble++ NeurIPS 2025
Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
comment: NeurIPS 2025
♻ ☆ FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.
comment: 49 pages, 29 figures, under review
♻ ☆ PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning NeurIPS 2025
In recent years, unlearning techniques, which are methods for inducing a model to "forget" previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.
comment: Accepted at NeurIPS 2025 Workshop: Evaluating the Evolving LLM Lifecycle
♻ ☆ Untargeted Jailbreak Attack
Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA's unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM vulnerabilities.Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20%.
♻ ☆ Co-TAP: Three-Layer Agent Interaction Protocol Technical Report
This paper proposes Co-TAP (T: Triple, A: Agent, P: Protocol), a three-layer agent interaction protocol designed to address the challenges faced by multi-agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human-Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory-Extraction-Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event-driven communication paradigm. This ensures the real-time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ''Memory (M) - Extraction (E) - Knowledge (K)'' cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi-agent applications.
♻ ☆ VIRAL: Vision-grounded Integration for Reward design And Learning
The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.
♻ ☆ Context-level Language Modeling by Learning Predictive Context Embeddings
Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
comment: 16pages,6 figures
♻ ☆ CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
comment: Project Page: https://deepreinforce-ai.github.io/cudal1_blog/
♻ ☆ Mixture-of-Experts Meets In-Context Reinforcement Learning
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
comment: 28 pages, 13 figures, 17 tables
♻ ☆ GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning ACM MM 2025
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.
comment: Accepted at ACM MM 2025 (also known as GEMeX-ThinkVG)
♻ ☆ A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning NeurIPS 2025
Multi-source transfer learning provides an effective solution to data scarcity in real- world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi- source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://github.com/zqy0126/OTQMS.
comment: NeurIPS 2025 Poster
♻ ☆ PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models
Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments. The code will be available at https://github.com/HeXiao-55/PTQTP.
comment: under review
♻ ☆ MathBode: Understanding LLM Reasoning with Dynamical Systems
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
♻ ☆ Multimodal 3D Genome Pre-training NeurIPS 2025
Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.
comment: NeurIPS 2025
♻ ☆ MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards
Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
♻ ☆ PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM
We present PEARL (Peer-Enhanced Adaptive Radio via On-Device LLM), a framework for cooperative cross-layer optimization in device-to-device (D2D) communication. Building on our previous work on single-device on-device LLMs, PEARL extends the paradigm by leveraging both publisher and subscriber states to guide Wi-Fi Aware (WA) parameter selection. A context-aware reward, which normalizes latency by application tolerances and modulates energy by device battery states, provides richer supervision for KL-based finetuning. We study two lightweight variants: PEARL (Head + Low-Rank Adaptation (LoRA)) achieves the best overall performance, while PEARL-Lite (Head-only) delivers sub-20 ms inference at near-identical objective scores. Across synthetic scenarios grounded in real measurements, PEARL improves objective scores over heuristic and compact model baselines and reduces energy by up to 16% in cooperative low-battery cases. These results demonstrate that peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control. Code, real-world demo, and dataset are available at https://github.com/abman23/pearl
♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose TrajAgent, an agent framework powered by large language models, designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on five tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in automated trajectory modeling, achieving a performance improvement of 2.38%-69.91% over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
comment: Accepted by NeurIPS 2025, https://github.com/tsinghua-fib-lab/TrajAgent
♻ ☆ MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
♻ ☆ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay NeurIPS 2025
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.
comment: Accepted at NeurIPS 2025
♻ ☆ Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.
comment: 29 pages, 4 figures
♻ ☆ Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.
♻ ☆ ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
♻ ☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
♻ ☆ Robustness is Important: Limitations of LLMs for Data Fitting
Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.
♻ ☆ HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment
Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environment is critical for autonomous robots. While egocentric observations from RGB-D sensors provide abundant local information, real-time top-down maps offer valuable global context for ObjNav. Nevertheless, the majority of existing studies focus on a single source, seldom integrating these two complementary perceptual modalities, despite the fact that humans naturally attend to both. With the rapid advancement of Vision-Language Models(VLMs), we propose Hybrid Perception Navigation (HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding capabilities to jointly perceive both local and global information to enhance the effectiveness and intelligence of navigation in unknown environments. In both massive simulation evaluation and real-world validation, our methods achieved state-of-the-art performance against popular baselines. Benefiting from hybrid perception approach, our method captures richer cues and finds the objects more effectively, by simultaneously leveraging information understanding from egocentric observations and the top-down map. Our ablation study further proved that either of the hybrid perception contributes to the navigation performance.
comment: under review
♻ ☆ A Neuroscience-Inspired Dual-Process Model of Compositional Generalization
Deep learning models struggle with systematic compositional generalization, a hallmark of human cognition. We propose \textsc{Mirage}, a neuro-inspired dual-process model that offers a processing account for this ability. It combines a fast, intuitive ``System~1'' (a meta-trained Transformer) with a deliberate, rule-based ``System~2'' (a Schema Engine), mirroring the brain's neocortical and hippocampal--prefrontal circuits. Trained to perform general, single-step decomposition on a stream of random grammars, Mirage achieves $>$99\% accuracy on all splits of the SCAN benchmark in a task-agnostic setting. Ablations confirm that the model's systematic behavior emerges from the architectural interplay of its two systems, particularly its use of explicit, prioritized schemas and iterative refinement. In line with recent progress on recursive/recurrent Transformer approaches, Mirage preserves an iterative neural update while externalizing declarative control into an interpretable schema module. Our work provides a concrete computational model for interpreting how compositional reasoning can arise from a modular cognitive architecture.
♻ ☆ Navigation with VLM framework: Towards Going to Any Language
Navigating towards fully open language goals and exploring open scenes in an intelligent way have always raised significant challenges. Recently, Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data. Although many works have focused on leveraging VLMs for navigation in open scenes, they often require high computational cost, rely on object-centric approaches, or depend on environmental priors in detailed human instructions. We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively, even for human-friendly language goal such as abstract places, actions, or specific objects in open scenes. NavVLM leverages the VLM as its cognitive core to perceive environmental information and constantly provides exploration guidance achieving intelligent navigation with only a neat target rather than a detailed instruction with environment prior. We evaluated and validated NavVLM in both simulation and real-world experiments. In simulation, our framework achieves state-of-the-art performance in Success weighted by Path Length (SPL) on object-specifc tasks in richly detailed environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and Gibson. With navigation episode reported, NavVLM demonstrates the capabilities to navigate towards any open-set languages. In real-world validation, we validated our framework's effectiveness in real-world robot at indoor scene.
comment: under review
♻ ☆ Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach
Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.
♻ ☆ Your Dense Retriever is Secretly an Expeditious Reasoner
Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) can reformulate queries to capture complex reasoning, applying them universally incurs significant computational cost. In this work, we propose Adaptive Query Reasoning (AdaQR), a hybrid query rewriting framework. Within this framework, a Reasoner Router dynamically directs each query to either fast dense reasoning or deep LLM reasoning. The dense reasoning is achieved by the Dense Reasoner, which performs LLM-style reasoning directly in the embedding space, enabling a controllable trade-off between efficiency and accuracy. Experiments on large-scale retrieval benchmarks BRIGHT show that AdaQR reduces reasoning cost by 28% while preserving-or even improving-retrieval performance by 7%.
comment: 16 pages, 11 figures
♻ ☆ From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports
Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.
comment: 10 pages, 6 figures, 4 tables
♻ ☆ BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
♻ ☆ Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model IROS 2025
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
comment: Accepted at IROS 2025. Project page: https://vita-epfl.github.io/DFI-OmniStereo-website/
♻ ☆ PanicToCalm: A Proactive Counseling Agent for Panic Attacks EMNLP 2025
Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce PACE, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train PACER, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that PACER outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with PACER consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios (Code is available at https://github.com/JihyunLee1/PanicToCalm ).
comment: Accepted in EMNLP 2025
♻ ☆ SEER: The Span-based Emotion Evidence Retrieval Benchmark
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
♻ ☆ GRS: Generating Robotic Simulation Tasks from Real-World Images
We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.
♻ ☆ PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings ACL 2025
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
comment: ACL 2025 Main. Code and dataset are released at: https://github.com/holi-lab/PVP_Personalized_Visual_Persuasion
♻ ☆ SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving
End-to-end autonomous driving has emerged as a promising paradigm for achieving robust and intelligent driving policies. However, existing end-to-end methods still face significant challenges, such as suboptimal decision-making in complex scenarios. In this paper,we propose SimpleVSF (Simple VLM-Scoring Fusion), a novel framework that enhances end-to-end planning by leveraging the cognitive capabilities of Vision-Language Models (VLMs) and advanced trajectory fusion techniques. We utilize the conventional scorers and the novel VLM-enhanced scorers. And we leverage a robust weight fusioner for quantitative aggregation and a powerful VLM-based fusioner for qualitative, context-aware decision-making. As the leading approach in the ICCV 2025 NAVSIM v2 End-to-End Driving Challenge, our SimpleVSF framework demonstrates state-of-the-art performance, achieving a superior balance between safety, comfort, and efficiency.
♻ ☆ GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\unicode{x2014}$a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy$\unicode{x2014}$compared to 57.6% on a dataset generated by recent work. Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found $\href{this https URL}{here}$.
comment: 22 pages, 3 figures, 3 tables, project page: https://ke7.github.io/graid/
♻ ☆ Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
Humans are remarkably adept at collaboration, able to infer the strengths and weaknesses of new partners in order to work successfully towards shared goals. To build AI systems with this capability, we must first understand its building blocks: does such flexibility require explicit, dedicated mechanisms for modelling others -- or can it emerge spontaneously from the pressures of open-ended cooperative interaction? To investigate this question, we train simple model-free RNN agents to collaborate with a population of diverse partners. Using the `Overcooked-AI' environment, we collect data from thousands of collaborative teams, and analyse agents' internal hidden states. Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners' task abilities, enabling rapid adaptation and generalisation to novel collaborators. We investigated these internal models through probing techniques, and large-scale behavioural analysis. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation. Our results show that partner modelling can arise spontaneously in model-free agents -- but only under environmental conditions that impose the right kind of social pressure.
♻ ☆ 3D-Prover: Diversity Driven Theorem Proving With Determinantal Point Processes
A key challenge in automated formal reasoning is the intractable search space, which grows exponentially with the depth of the proof. This branching is caused by the large number of candidate proof tactics which can be applied to a given goal. Nonetheless, many of these tactics are semantically similar or lead to an execution error, wasting valuable resources in both cases. We address the problem of effectively pruning this search, using only synthetic data generated from previous proof attempts. We first demonstrate that it is possible to generate semantically aware tactic representations which capture the effect on the proving environment, likelihood of success, and execution time. We then propose a novel filtering mechanism which leverages these representations to select semantically diverse and high quality tactics, using Determinantal Point Processes. Our approach, 3D- Prover, is designed to be general, and to augment any underlying tactic generator. We demonstrate the effectiveness of 3D-Prover on the miniF2F and LeanDojo benchmarks by augmenting popular open source proving LLMs. We show that our approach leads to an increase in the overall proof rate, as well as a significant improvement in the tactic success rate, execution time and diversity. We make our code available at https://github.com/sean-lamont/3D-Prover.
♻ ☆ Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs
This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt), performance, and hardware scalability against NVIDIA A100 GPUs (in 4x and 8x configurations) within the National Research Platform (NRP) ecosystem. A total of 12 open-source LLMs, ranging from 124 million to 70 billion parameters, are served using the vLLM framework. Our analysis reveals that QAic achieves competitive energy efficiency with advantages on specific models while enabling more granular hardware allocation: some 70B models operate on as few as 1 QAic card versus 8 A100 GPUs required, with 20x lower power consumption (148W vs 2,983W). For smaller models, single QAic devices achieve up to 35x lower power consumption compared to our 4-GPU A100 configuration (36W vs 1,246W). The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for energy-constrained and resource-efficient HPC deployments within the National Research Platform (NRP).
comment: 8 pages, 3 tables
♻ ☆ Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.
♻ ☆ SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
♻ ☆ A Study on the Framework for Evaluating the Ethics and Trustworthiness of Generative AI
This study provides an in_depth analysis of the ethical and trustworthiness challenges emerging alongside the rapid advancement of generative artificial intelligence (AI) technologies and proposes a comprehensive framework for their systematic evaluation. While generative AI, such as ChatGPT, demonstrates remarkable innovative potential, it simultaneously raises ethical and social concerns, including bias, harmfulness, copyright infringement, privacy violations, and hallucination. Current AI evaluation methodologies, which mainly focus on performance and accuracy, are insufficient to address these multifaceted issues. Thus, this study emphasizes the need for new human_centered criteria that also reflect social impact. To this end, it identifies key dimensions for evaluating the ethics and trustworthiness of generative AI_fairness, transparency, accountability, safety, privacy, accuracy, consistency, robustness, explainability, copyright and intellectual property protection, and source traceability and develops detailed indicators and assessment methodologies for each. Moreover, it provides a comparative analysis of AI ethics policies and guidelines in South Korea, the United States, the European Union, and China, deriving key approaches and implications from each. The proposed framework applies across the AI lifecycle and integrates technical assessments with multidisciplinary perspectives, thereby offering practical means to identify and manage ethical risks in real_world contexts. Ultimately, the study establishes an academic foundation for the responsible advancement of generative AI and delivers actionable insights for policymakers, developers, users, and other stakeholders, supporting the positive societal contributions of AI technologies.
comment: 22 pages, 3 figures, 6 tables
♻ ☆ Meta-Learning Objectives for Preference Optimization
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.
♻ ☆ Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models NeurIPS 2025
Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
comment: NeurIPS 2025
♻ ☆ OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
comment: 11 pages, 4 Figures, 3 Tables, Published In Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing
♻ ☆ Doubly Robust Alignment for Large Language Models NeurIPS 2025
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
comment: Accepted to NeurIPS 2025
♻ ☆ p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments. The code is available at https://github.com/ryttry/p-less .
♻ ☆ Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour
Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks such as miscoordination or goal misalignment. Explainability is vital for users' trust calibration, but explainable MAS face challenges due to complex environments, the human factor, and non-standardised evaluation. Leveraging the counterfactual effect size model and LLMs, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates human-centred action explanations for multi-agent policies by having an LLM interrogate an environment simulator using prompts like 'whatif' and 'remove' to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across ten scenarios for five LLMs with a comprehensive methodology combining robustness, subjective preference, correctness, and goal/action prediction with an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for four models, with comparable action prediction accuracy, achieving the highest scores overall. Our code is open-sourced at https://github.com/gyevnarb/axis.
♻ ☆ CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models
This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs' resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley-Terry-Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an $\epsilon$-optimal ranking with high probability while allowing as large as $O(n)$ perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.
♻ ☆ HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in agents that lack initiative and cannot interact with the physical scene. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance generated by HAMLET, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences.
♻ ☆ WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models ACL 2025
Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
comment: Accepted by ACL 2025
♻ ☆ Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing NeurIPS 2025
We study a sequential decision-making problem on a $n$-node graph $\mathcal{G}$ where each node has an unknown label from a finite set $\mathbf{\Omega}$, drawn from a joint distribution $\mathcal{P}$ that is Markov with respect to $\mathcal{G}$. At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when $\mathcal{G}$ is a forest. Our implementation runs in $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|^2)$ time while using $\mathcal{O}(n \cdot |\mathbf{\Omega}|^2)$ oracle calls to $\mathcal{P}$ and $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|)$ space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.
comment: Accepted into NeurIPS 2025
♻ ☆ What Causes Postoperative Aspiration?
Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model to predict postoperative aspiration, enabling timely preventative interventions. Methods: From the MIMIC-IV database of over 400,000 hospital admissions, we identified 826 surgical patients (mean age: 62, 55.7\% male) who experienced aspiration within seven days post-surgery, along with a matched non-aspiration cohort. Three ML models: XGBoost, Multilayer Perceptron, and Random Forest were trained using pre-surgical hospitalization data to predict postoperative aspiration. To investigate causation, we estimated Average Treatment Effects (ATE) using Augmented Inverse Probability Weighting. Results: Our ML model achieved an AUROC of 0.86 and 77.3\% sensitivity on a held-out test set. Maximum daily opioid dose, length of stay, and patient age emerged as the most important predictors. ATE analysis identified significant causative factors: opioids (0.25 +/- 0.06) and operative site (neck: 0.20 +/- 0.13, head: 0.19 +/- 0.13). Despite equal surgery rates across genders, men were 1.5 times more likely to aspirate and received 27\% higher maximum daily opioid dosages compared to women. Conclusion: ML models can effectively predict postoperative aspiration risk, enabling targeted preventative measures. Maximum daily opioid dosage and operative site significantly influence aspiration risk. The gender disparity in both opioid administration and aspiration rates warrants further investigation. These findings have important implications for improving postoperative care protocols and aspiration prevention strategies.
♻ ☆ Non-Markovian Discrete Diffusion with Causal Language Models NeurIPS 2025
Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
comment: 15 pages, 7 figures, submitted to arXiv October 2025. All models, datasets, and training code will be released at https://huggingface.co/collections/utter-project/towervision
♻ ☆ BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.
♻ ☆ Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction
The increasing digitalization of education presents unprecedented opportunities for data-driven personalization, but it also introduces significant challenges to student data privacy. Conventional recommender systems rely on centralized data, a paradigm often incompatible with modern data protection regulations. A novel privacy-preserving recommender system is proposed and evaluated to address this critical issue using Federated Learning (FL). The approach utilizes a Deep Neural Network (DNN) with rich, engineered features from the large-scale ASSISTments educational dataset. A rigorous comparative analysis of federated aggregation strategies was conducted, identifying FedProx as a significantly more stable and effective method for handling heterogeneous student data than the standard FedAvg baseline. The optimized federated model achieves a high-performance F1-Score of 76.28%, corresponding to 92% of the performance of a powerful, centralized XGBoost model. These findings validate that a federated approach can provide highly effective content recommendations without centralizing sensitive student data. Consequently, our work presents a viable and robust solution to the personalization-privacy dilemma in modern educational platforms.
comment: This paper has been prepared to be submitted to the Informatics in Education
♻ ☆ Flow matching for reaction pathway generation
Elucidating reaction mechanisms hinges on efficiently generating transition states (TSs), products, and complete reaction networks. Recent generative models, such as diffusion models for TS sampling and sequence-based architectures for product generation, offer faster alternatives to quantum-chemistry searches. But diffusion models remain constrained by their stochastic differential equation (SDE) dynamics, which suffer from inefficiency and limited controllability. We show that flow matching, a deterministic ordinary differential (ODE) formulation, can replace SDE-based diffusion for molecular and reaction generation. We introduce MolGEN, a conditional flow-matching framework that learns an optimal transport path to transport Gaussian priors to target chemical distributions. On benchmarks used by TSDiff and OA-ReactDiff, MolGEN surpasses TS geometry accuracy and barrier-height prediction while reducing sampling to sub-second inference. MolGEN also supports open-ended product generation with competitive top-k accuracy and avoids mass/electron-balance violations common to sequence models. In a realistic test on the $\gamma$-ketohydroperoxide decomposition network, MolGEN yields higher fractions of valid and intended TSs with markedly fewer quantum-chemistry evaluations than string-based baselines. These results demonstrate that deterministic flow matching provides a unified, accurate, and computationally efficient foundation for molecular generative modeling, signaling that flow matching is the future for molecular generation across chemistry.
comment: Updates from the previous version: 1. Redeveloped the model for general purpose instead of just transition state generation, and renamed the package to MolGEN. 2. The prediction error reported in the previous version was wrong due to a misplaced mask in the code, updated. 3. Added benchmarks for reaction product generation and did a full-round experiment on reaction network exploration
♻ ☆ Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
♻ ☆ Graph Mixing Additive Networks
We introduce GMAN, a flexible, interpretable, and expressive framework that extends Graph Neural Additive Networks (GNANs) to learn from sets of sparse time-series data. GMAN represents each time-dependent trajectory as a directed graph and applies an enriched, more expressive GNAN to each graph. It allows users to control the interpretability-expressivity trade-off by grouping features and graphs to encode priors, and it provides feature, node, and graph-level interpretability. On real-world datasets, including mortality prediction from blood tests and fake-news detection, GMAN outperforms strong non-interpretable black-box baselines while delivering actionable, domain-aligned explanations.
comment: arXiv admin note: substantial text overlap with arXiv:2505.19193
♻ ☆ The Neural Pruning Law Hypothesis
Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most current pruning methods rely on ad-hoc heuristics that are poorly understood. We introduce Hyperflux, a conceptually-grounded pruning method, and use it to study the pruning process. Hyperflux models this process as an interaction between weight flux, the gradient's response to the weight's removal, and network pressure, a global regularization driving weights towards pruning. We postulate properties that arise naturally from our framework and find that the relationship between minimum flux among weights and density follows a power-law equation. Furthermore, we hypothesize the power-law relationship to hold for any effective saliency metric and call this idea the Neural Pruning Law Hypothesis. We validate our hypothesis on several families of pruning methods (magnitude, gradients, $L_0$), providing a potentially unifying property for neural pruning.
Machine Learning 282
☆ Generative View Stitching
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
comment: Project website: https://andrewsonga.github.io/gvs
☆ A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization NeurIPS 2025
We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https://github.com/ShenGroup/SFLCB.
comment: NeurIPS 2025
☆ Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NeurIPS 2025
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.
comment: Accepted as a Spotlight at NeurIPS 2025
☆ Tongyi DeepResearch Technical Report
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
comment: https://tongyi-agent.github.io/blog
☆ Greedy Sampling Is Provably Efficient for RLHF NeurIPS 2025
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
comment: NeurIPS 2025
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ Learning to Drive Safely with Hybrid Options
Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
☆ Eigenfunction Extraction for Ordered Representation Learning
Recent advances in representation learning reveal that widely used objectives, such as contrastive and non-contrastive, implicitly perform spectral decomposition of a contextual kernel, induced by the relationship between inputs and their contexts. Yet, these methods recover only the linear span of top eigenfunctions of the kernel, whereas exact spectral decomposition is essential for understanding feature ordering and importance. In this work, we propose a general framework to extract ordered and identifiable eigenfunctions, based on modular building blocks designed to satisfy key desiderata, including compatibility with the contextual kernel and scalability to modern settings. We then show how two main methodological paradigms, low-rank approximation and Rayleigh quotient optimization, align with this framework for eigenfunction extraction. Finally, we validate our approach on synthetic kernels and demonstrate on real-world image datasets that the recovered eigenvalues act as effective importance scores for feature selection, enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional representations.
☆ Pearl: A Foundation Model for Placing Every Atom in the Right Location
Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
☆ The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets NeurIPS 2025
We study the parameter complexity of robust memorization for $\mathrm{ReLU}$ networks: the number of parameters required to interpolate any given dataset with $\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\mu$-ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\rho = \mu / \epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\rho \in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\rho$ is small, but grows with increasing $\rho$.
comment: Accepted to NeurIPS 2025, 72 pages, 8 figures
☆ Causal Ordering for Structure Learning From Time Series
Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}\!3-\!6$ variables, $T{=}200\!-\!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20\!-\!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.
comment: 32 pages
☆ Symbolic Snapshot Ensembles
Inductive logic programming (ILP) is a form of logical machine learning. Most ILP algorithms learn a single hypothesis from a single training run. Ensemble methods train an ILP algorithm multiple times to learn multiple hypotheses. In this paper, we train an ILP algorithm only once and save intermediate hypotheses. We then combine the hypotheses using a minimum description length weighting scheme. Our experiments on multiple benchmarks, including game playing and visual reasoning, show that our approach improves predictive accuracy by 4% with less than 1% computational overhead.
☆ Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers NeurIPS 2025
We study the robust geometric median problem in Euclidean space $\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count $m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot \min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $\tilde{\Theta}(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.
comment: This paper has been accepted by NeurIPS 2025
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
comment: 12 Pages
☆ Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation
For three decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. It provides the fundamental limits of learning random deep neural network targets and helps in identifying the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data optimal performance is attained through model's "specialisation" towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayesian-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant way beyond it.
comment: 30 pages, 19 figures + appendix. This submission supersedes both arXiv:2505.24849 and arXiv:2501.18530
☆ Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures
Health indicators (HIs) are central to diagnosing and prognosing the condition of aerospace composite structures, enabling efficient maintenance and operational safety. However, extracting reliable HIs remains challenging due to variability in material properties, stochastic damage evolution, and diverse damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents (e.g., bird strikes) further complicate this process. This study presents a comprehensive data-driven framework that learns HIs via two learning approaches integrated with multi-domain signal processing. Because ground-truth HIs are unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach augmented with continuous auxiliary labels used as hypothetical damage proxies, which overcomes the limitation of prior binary labels that only distinguish healthy and failed states while neglecting intermediate degradation, and (ii) a degradation-trend-constrained variational autoencoder (DTC-VAE), in which the monotonicity criterion is embedded via an explicit trend constraint. Guided waves with multiple excitation frequencies are used to monitor single-stiffener composite structures under fatigue loading. Time, frequency, and time-frequency representations are explored, and per-frequency HIs are fused via unsupervised ensemble learning to mitigate frequency dependence and reduce variance. Using fast Fourier transform features, the augmented Diversity-DeepSAD model achieved 81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3% performance, outperforming existing baselines.
☆ Comparison of generalised additive models and neural networks in applications: A systematic review
Neural networks have become a popular tool in predictive modelling, more commonly associated with machine learning and artificial intelligence than with statistics. Generalised Additive Models (GAMs) are flexible non-linear statistical models that retain interpretability. Both are state-of-the-art in their own right, with their respective advantages and disadvantages. This paper analyses how these two model classes have performed on real-world tabular data. Following PRISMA guidelines, we conducted a systematic review of papers that performed empirical comparisons of GAMs and neural networks. Eligible papers were identified, yielding 143 papers, with 430 datasets. Key attributes at both paper and dataset levels were extracted and reported. Beyond summarising comparisons, we analyse reported performance metrics using mixed-effects modelling to investigate potential characteristics that can explain and quantify observed differences, including application area, study year, sample size, number of predictors, and neural network complexity. Across datasets, no consistent evidence of superiority was found for either GAMs or neural networks when considering the most frequently reported metrics (RMSE, $R^2$, and AUC). Neural networks tended to outperform in larger datasets and in those with more predictors, but this advantage narrowed over time. Conversely, GAMs remained competitive, particularly in smaller data settings, while retaining interpretability. Reporting of dataset characteristics and neural network complexity was incomplete in much of the literature, limiting transparency and reproducibility. This review highlights that GAMs and neural networks should be viewed as complementary approaches rather than competitors. For many tabular applications, the performance trade-off is modest, and interpretability may favour GAMs.
☆ A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies
Current quantum machine learning approaches often face challenges balancing predictive accuracy, robustness, and interpretability. To address this, we propose a novel quantum adversarial framework that integrates a hybrid quantum neural network (QNN) with classical deep learning layers, guided by an evaluator model with LIME-based interpretability, and extended through quantum GAN and self-supervised variants. In the proposed model, an adversarial evaluator concurrently guides the QNN by computing feedback loss, thereby optimizing both prediction accuracy and model explainability. Empirical evaluations show that the Vanilla model achieves RMSE = 0.27, MSE = 0.071, MAE = 0.21, and R^2 = 0.59, delivering the most consistent performance across regression metrics compared to adversarial counterparts. These results demonstrate the potential of combining quantum-inspired methods with classical architectures to develop lightweight, high-performance, and interpretable predictive models, advancing the applicability of QML beyond current limitations.
☆ Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges
We are very delighted to see the fast development of physics-informed extreme learning machine (PIELM) in recent years for higher computation efficiency and accuracy in physics-informed machine learning. As a summary or review on PIELM is currently not available, we would like to take this opportunity to show our perspective and experience for this promising research direction. We can see many efforts are made to solve PDEs with sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling. Despite the success, many urgent challenges remain to be tackled, which also provides us opportunities to develop more robust, interpretable, and generalizable PIELM frameworks with applications in science and engineering.
☆ DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.
☆ LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.
☆ Enforcing boundary conditions for physics-informed neural operators
Machine-learning based methods like physics-informed neural networks and physics-informed neural operators are becoming increasingly adept at solving even complex systems of partial differential equations. Boundary conditions can be enforced either weakly by penalizing deviations in the loss function or strongly by training a solution structure that inherently matches the prescribed values and derivatives. The former approach is easy to implement but the latter can provide benefits with respect to accuracy and training times. However, previous approaches to strongly enforcing Neumann or Robin boundary conditions require a domain with a fully $C^1$ boundary and, as we demonstrate, can lead to instability if those boundary conditions are posed on a segment of the boundary that is piecewise $C^1$ but only $C^0$ globally. We introduce a generalization of the approach by Sukumar \& Srivastava (doi: 10.1016/j.cma.2021.114333), and a new approach based on orthogonal projections that overcome this limitation. The performance of these new techniques is compared against weakly and semi-weakly enforced boundary conditions for the scalar Darcy flow equation and the stationary Navier-Stokes equations.
☆ Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks
Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL) are data inefficient and short-sighted. Such RL-based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long-term planning requirements. To address these limitations, in this paper, a novel dual-mind world model-based learning framework is proposed with the goal of optimizing completeness-weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component to learn dynamics and logic of the wireless network, and to provide long-term link scheduling over reliable imagined trajectories. Link scheduling is learned through end-to-end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real-world physical channel, ray-tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state-of-the-art RL baselines, and the world model approach with only System 1.
☆ Unsupervised Machine-Learning Pipeline for Data-Driven Defect Detection and Characterisation: Application to Displacement Cascades
Neutron irradiation produces, within a few picoseconds, displacement cascades that are sequences of atomic collisions generating point and extended defects which subsequently affects the long-term evolution of materials. The diversity of these defects, characterized morphologically and statistically, defines what is called the "primary damage". In this work, we present a fully unsupervised machine learning (ML) workflow that detects and classifies these defects directly from molecular dynamics data. Local environments are encoded by the Smooth Overlap of Atomic Positions (SOAP) vector, anomalous atoms are isolated with autoencoder neural networks (AE), embedded with Uniform Manifold Approximation and Projection (UMAP) and clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Applied to 80 keV displacement cascades in Ni, Fe$_7$0Ni$_{10}$Cr$_{20}$, and Zr, the AE successfully identify the small fraction of outlier atoms that participate in defect formation. HDBSCAN then partitions the UMAP latent space of AE-flagged SOAP descriptors into well defined groups representing vacancy- and interstitial-dominated regions and, within each, separates small from large aggregates, assigning 99.7 % of outliers to compact physical motifs. A signed cluster-identification score confirms this separation, and cluster size scales with net defect counts (R2 > 0.89). Statistical cross analyses between the ML outlier map and several conventional detectors (centrosymmetry, dislocation extraction, etc.) reveal strong overlap and complementary coverage, all achieved without template or threshold tuning. This ML workflow thus provides an efficient tool for the quantitative mapping of structural anomalies in materials, particularly those arising from irradiation damage in displacement cascades.
comment: 22 pages, 1 graphical abstract, 7 figures, 4 tables
☆ Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
☆ MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU
Sepsis is a leading cause of mortality in intensive care units (ICUs), yet existing research often relies on outdated datasets, non-reproducible preprocessing pipelines, and limited coverage of clinical interventions. We introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from the MIMIC-IV database, designed to support reproducible modeling of sepsis trajectories. Our cohort includes 35,239 ICU patients with time-aligned clinical variables and standardized treatment data, including vasopressors, fluids, mechanical ventilation and antibiotics. We describe a transparent preprocessing pipeline-based on Sepsis-3 criteria, structured imputation strategies, and treatment inclusion-and release it alongside benchmark tasks focused on early mortality prediction, length-of-stay estimation, and shock onset classification. Empirical results demonstrate that incorporating treatment variables substantially improves model performance, particularly for Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research.
☆ Sample-efficient and Scalable Exploration in Continuous-Time RL
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
comment: 26 pages, 6 figures, 6 tables
☆ Methodology for Comparing Machine Learning Algorithms for Survival Analysis
This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of S\~ao Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.
☆ Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations
The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emph{GD map is non-singular} -- it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts the analysis to GD with small step-sizes. In this paper, we investigate the neural network map as a function on the space of weights and biases. We also prove, for the first time, the non-singularity of the gradient descent (GD) map on the loss landscape of realistic neural network architectures (with fully connected, convolutional, or softmax attention layers) and piecewise analytic activations (which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our work significantly extends the existing results on the convergence of GD and SGD by guaranteeing that they apply to practical neural network settings and has the potential to unlock further exploration of learning dynamics.
☆ ARIMA_PLUS: Large-scale, Accurate, Automatic and Interpretable In-Database Time Series Forecasting and Anomaly Detection in Google BigQuery
Time series forecasting and anomaly detection are common tasks for practitioners in industries such as retail, manufacturing, advertising and energy. Two unique challenges stand out: (1) efficiently and accurately forecasting time series or detecting anomalies in large volumes automatically; and (2) ensuring interpretability of results to effectively incorporate business insights. We present ARIMA_PLUS, a novel framework to overcome these two challenges by a unique combination of (a) accurate and interpretable time series models and (b) scalable and fully managed system infrastructure. The model has a sequential and modular structure to handle different components of the time series, including holiday effects, seasonality, trend, and anomalies, which enables high interpretability of the results. Novel enhancements are made to each module, and a unified framework is established to address both forecasting and anomaly detection tasks simultaneously. In terms of accuracy, its comprehensive benchmark on the 42 public datasets in the Monash forecasting repository shows superior performance over not only well-established statistical alternatives (such as ETS, ARIMA, TBATS, Prophet) but also newer neural network models (such as DeepAR, N-BEATS, PatchTST, TimeMixer). In terms of infrastructure, it is directly built into the query engine of BigQuery in Google Cloud. It uses a simple SQL interface and automates tedious technicalities such as data cleaning and model selection. It automatically scales with managed cloud computational and storage resources, making it possible to forecast 100 million time series using only 1.5 hours with a throughput of more than 18000 time series per second. In terms of interpretability, we present several case studies to demonstrate time series insights it generates and customizability it offers.
☆ Nearest Neighbor Matching as Least Squares Density Ratio Estimation and Riesz Regression
This study proves that Nearest Neighbor (NN) matching can be interpreted as an instance of Riesz regression for automatic debiased machine learning. Lin et al. (2023) shows that NN matching is an instance of density-ratio estimation with their new density-ratio estimator. Chernozhukov et al. (2024) develops Riesz regression for automatic debiased machine learning, which directly estimates the Riesz representer (or equivalently, the bias-correction term) by minimizing the mean squared error. In this study, we first prove that the density-ratio estimation method proposed in Lin et al. (2023) is essentially equivalent to Least-Squares Importance Fitting (LSIF) proposed in Kanamori et al. (2009) for direct density-ratio estimation. Furthermore, we derive Riesz regression using the LSIF framework. Based on these results, we derive NN matching from Riesz regression. This study is based on our work Kato (2025a) and Kato (2025b).
☆ Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.
☆ Attack on a PUF-based Secure Binary Neural Network
Binarized Neural Networks (BNNs) deployed on memristive crossbar arrays provide energy-efficient solutions for edge computing but are susceptible to physical attacks due to memristor nonvolatility. Recently, Rajendran et al. (IEEE Embedded Systems Letter 2025) proposed a Physical Unclonable Function (PUF)-based scheme to secure BNNs against theft attacks. Specifically, the weight and bias matrices of the BNN layers were secured by swapping columns based on device's PUF key bits. In this paper, we demonstrate that this scheme to secure BNNs is vulnerable to PUF-key recovery attack. As a consequence of our attack, we recover the secret weight and bias matrices of the BNN. Our approach is motivated by differential cryptanalysis and reconstructs the PUF key bit-by-bit by observing the change in model accuracy, and eventually recovering the BNN model parameters. Evaluated on a BNN trained on the MNIST dataset, our attack could recover 85% of the PUF key, and recover the BNN model up to 93% classification accuracy compared to the original model's 96% accuracy. Our attack is very efficient and it takes a couple of minutes to recovery the PUF key and the model parameters.
comment: Accepted at VLSID 2026. To be published in IEEE Xplore
☆ APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries
Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-$k$ sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-$k$ compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.
☆ A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport
Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research. Despite rapid progress in generative modeling, there is limited attention to comprehensive evaluation, leaving unclear how reliable, safe, and useful synthetic data truly are. Existing evaluations remain fragmented, typically limited to population-level representativeness or record-level privacy, without considering group-level variations or task-specific utility. To address this gap, we propose a Representativeness-Privacy-Utility (RPU) framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels (record, group, population). The framework integrates a consistent set of metrics to quantify similarity, disclosure risk, and practical usefulness, enabling transparent and balanced assessment of synthetic data quality. We apply the framework to benchmark twelve representative generation methods, spanning conventional statistical models, deep generative networks, and privacy-enhanced variants. Results show that synthetic data do not inherently guarantee privacy and there is no "one-size-fits-all" model, the trade-off between privacy and representativeness/utility is obvious. Conditional Tabular generative adversarial network (CTGAN) provide the most balanced trade-off and is suggested for practical applications. The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.
☆ Filtering instances and rejecting predictions to obtain reliable models in healthcare
Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.
comment: This paper is under review at Machine Learning (Springer)
☆ Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
We introduce Perception Learning (PeL), a paradigm that optimizes an agent's sensory interface $f_\phi:\mathcal{X}\to\mathcal{Z}$ using task-agnostic signals, decoupled from downstream decision learning $g_\theta:\mathcal{Z}\to\mathcal{Y}$. PeL directly targets label-free perceptual properties, such as stability to nuisances, informativeness without collapse, and controlled geometry, assessed via objective representation-invariant metrics. We formalize the separation of perception and decision, define perceptual properties independent of objectives or reparameterizations, and prove that PeL updates preserving sufficient invariants are orthogonal to Bayes task-risk gradients. Additionally, we provide a suite of task-agnostic evaluation metrics to certify perceptual quality.
☆ What do vision-language models see in the context? Investigating multimodal in-context learning
In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
Transformers can do Bayesian Clustering
Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
☆ EDC: Equation Discovery for Classification
Equation Discovery techniques have shown considerable success in regression tasks, where they are used to discover concise and interpretable models (\textit{Symbolic Regression}). In this paper, we propose a new ED-based binary classification framework. Our proposed method EDC finds analytical functions of manageable size that specify the location and shape of the decision boundary. In extensive experiments on artificial and real-life data, we demonstrate how EDC is able to discover both the structure of the target equation as well as the value of its parameters, outperforming the current state-of-the-art ED-based classification methods in binary classification and achieving performance comparable to the state of the art in binary classification. We suggest a grammar of modest complexity that appears to work well on the tested datasets but argue that the exact grammar -- and thus the complexity of the models -- is configurable, and especially domain-specific expressions can be included in the pattern language, where that is required. The presented grammar consists of a series of summands (additive terms) that include linear, quadratic and exponential terms, as well as products of two features (producing hyperbolic curves ideal for capturing XOR-like dependencies). The experiments demonstrate that this grammar allows fairly flexible decision boundaries while not so rich to cause overfitting.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Lecture Notes in Computer Science, and is available online at https://doi.org/10.1007/978-3-032-05461-6_9
☆ Problem-Parameter-Free Decentralized Bilevel Optimization NeurIPS 2025
Decentralized bilevel optimization has garnered significant attention due to its critical role in solving large-scale machine learning problems. However, existing methods often rely on prior knowledge of problem parameters-such as smoothness, convexity, or communication network topologies-to determine appropriate stepsizes. In practice, these problem parameters are typically unavailable, leading to substantial manual effort for hyperparameter tuning. In this paper, we propose AdaSDBO, a fully problem-parameter-free algorithm for decentralized bilevel optimization with a single-loop structure. AdaSDBO leverages adaptive stepsizes based on cumulative gradient norms to update all variables simultaneously, dynamically adjusting its progress and eliminating the need for problem-specific hyperparameter tuning. Through rigorous theoretical analysis, we establish that AdaSDBO achieves a convergence rate of $\widetilde{\mathcal{O}}\left(\frac{1}{T}\right)$, matching the performance of well-tuned state-of-the-art methods up to polylogarithmic factors. Extensive numerical experiments demonstrate that AdaSDBO delivers competitive performance compared to existing decentralized bilevel optimization methods while exhibiting remarkable robustness across diverse stepsize configurations.
comment: Accepted by NeurIPS 2025
☆ Towards actionable hypotension prediction- predicting catecholamine therapy initiation in the intensive care unit
Hypotension in critically ill ICU patients is common and life-threatening. Escalation to catecholamine therapy marks a key management step, with both undertreatment and overtreatment posing risks. Most machine learning (ML) models predict hypotension using fixed MAP thresholds or MAP forecasting, overlooking the clinical decision behind treatment escalation. Predicting catecholamine initiation, the start of vasoactive or inotropic agent administration offers a more clinically actionable target reflecting real decision-making. Using the MIMIC-III database, we modeled catecholamine initiation as a binary event within a 15-minute prediction window. Input features included statistical descriptors from a two-hour sliding MAP context window, along with demographics, biometrics, comorbidities, and ongoing treatments. An Extreme Gradient Boosting (XGBoost) model was trained and interpreted via SHapley Additive exPlanations (SHAP). The model achieved an AUROC of 0.822 (0.813-0.830), outperforming the hypotension baseline (MAP < 65, AUROC 0.686 [0.675-0.699]). SHAP analysis highlighted recent MAP values, MAP trends, and ongoing treatments (e.g., sedatives, electrolytes) as dominant predictors. Subgroup analysis showed higher performance in males, younger patients (<53 years), those with higher BMI (>32), and patients without comorbidities or concurrent medications. Predicting catecholamine initiation based on MAP dynamics, treatment context, and patient characteristics supports the critical decision of when to escalate therapy, shifting focus from threshold-based alarms to actionable decision support. This approach is feasible across a broad ICU cohort under natural event imbalance. Future work should enrich temporal and physiological context, extend label definitions to include therapy escalation, and benchmark against existing hypotension prediction systems.
comment: 27 pages, 8 figures, source code under https://github.com/AI4HealthUOL/actionable-hypotension
☆ HergNet: a Fast Neural Surrogate Model for Sound Field Predictions via Superposition of Plane Waves
We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies.
☆ SALS: Sparse Attention in Latent Space for KV cache Compression
Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
☆ UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation NeurIPS 2025
Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
☆ Forecasting precipitation in the Arctic using probabilistic machine learning informed by causal climate drivers
Understanding and forecasting precipitation events in the Arctic maritime environments, such as Bear Island and Ny-{\AA}lesund, is crucial for assessing climate risk and developing early warning systems in vulnerable marine regions. This study proposes a probabilistic machine learning framework for modeling and predicting the dynamics and severity of precipitation. We begin by analyzing the scale-dependent relationships between precipitation and key atmospheric drivers (e.g., temperature, relative humidity, cloud cover, and air pressure) using wavelet coherence, which captures localized dependencies across time and frequency domains. To assess joint causal influences, we employ Synergistic-Unique-Redundant Decomposition, which quantifies the impact of interaction effects among each variable on future precipitation dynamics. These insights inform the development of data-driven forecasting models that incorporate both historical precipitation and causal climate drivers. To account for uncertainty, we employ the conformal prediction method, which enables the generation of calibrated non-parametric prediction intervals. Our results underscore the importance of utilizing a comprehensive framework that combines causal analysis with probabilistic forecasting to enhance the reliability and interpretability of precipitation predictions in Arctic marine environments.
☆ Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
Large vision-language models (LVLMs) have recently demonstrated great potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by low Earth orbit (LEO) satellites. However, their deployment in real-world LEO satellite systems remains largely unexplored, hindered by limited onboard computing resources and brief satellite-ground contacts. We propose Grace, a satellite-ground collaborative system designed for near-realtime LVLM inference in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime inference, but larger ones on ground stations (GSs) to guarantee end-to-end performance. Grace is comprised of two main phases that are asynchronous satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch algorithm. Firstly, we still the knowledge archive of GS RAG to satellite archive with tailored adaptive update algorithm during limited satellite-ground data exchange period. Secondly, propose a confidence-based test algorithm that either processes the task onboard the satellite or offloads it to the GS. Extensive experiments based on real-world satellite orbital data show that Grace reduces the average latency by 76-95% compared to state-of-the-art methods, without compromising inference accuracy.
comment: 15 pages, 11 figures
☆ Temporal Knowledge Graph Hyperedge Forecasting: Exploring Entity-to-Category Link Prediction
Temporal Knowledge Graphs have emerged as a powerful way of not only modeling static relationships between entities but also the dynamics of how relations evolve over time. As these informational structures can be used to store information from a real-world setting, such as a news flow, predicting future graph components to a certain extent equates predicting real-world events. Most of the research in this field focuses on embedding-based methods, often leveraging convolutional neural net architectures. These solutions act as black boxes, limiting insight. In this paper, we explore an extension to an established rule-based framework, TLogic, that yields a high accuracy in combination with explainable predictions. This offers transparency and allows the end-user to critically evaluate the rules applied at the end of the prediction stage. The new rule format incorporates entity category as a key component with the purpose of limiting rule application only to relevant entities. When categories are unknown for building the graph, we propose a data-driven method to generate them with an LLM-based approach. Additionally, we investigate the choice of aggregation method for scores of retrieved entities when performing category prediction.
☆ PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
☆ Sparse Optimistic Information Directed Sampling
Many high-dimensional online decision-making problems can be modeled as stochastic sparse linear bandits. Most existing algorithms are designed to achieve optimal worst-case regret in either the data-rich regime, where polynomial depen- dence on the ambient dimension is unavoidable, or the data-poor regime, where dimension-independence is possible at the cost of worse dependence on the num- ber of rounds. In contrast, the sparse Information Directed Sampling (IDS) algo- rithm satisfies a Bayesian regret bound that has the optimal rate in both regimes simultaneously. In this work, we explore the use of Sparse Optimistic Informa- tion Directed Sampling (SOIDS) to achieve the same adaptivity in the worst-case setting, without Bayesian assumptions. Through a novel analysis that enables the use of a time-dependent learning rate, we show that SOIDS can optimally balance information and regret. Our results extend the theoretical guarantees of IDS, pro- viding the first algorithm that simultaneously achieves optimal worst-case regret in both the data-rich and data-poor regimes. We empirically demonstrate the good performance of SOIDS.
☆ PRIVET: Privacy Metric Based on Extreme Value Theory
Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
☆ A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networks
The sustainability of modern cities highly depends on efficient water distribution management, including effective pressure control and leak detection and localization. Accurate information about the network hydraulic state is therefore essential. This article presents a comparison between two data-driven state estimation methods based on the Unscented Kalman Filter (UKF), fusing pressure, demand and flow data for head and flow estimation. One approach uses a joint state vector with a single estimator, while the other uses a dual-estimator scheme. We analyse their main characteristics, discussing differences, advantages and limitations, and compare them theoretically in terms of accuracy and complexity. Finally, we show several estimation results for the L-TOWN benchmark, allowing to discuss their properties in a real implementation.
comment: This work has been submitted to ECC2026 for review. It has 7 pages and 2 figures
☆ Closing Gaps: An Imputation Analysis of ICU Vital Signs
As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.
comment: Preprint
☆ Unlocking Out-of-Distribution Generalization in Dynamics through Physics-Guided Augmentation
In dynamical system modeling, traditional numerical methods are limited by high computational costs, while modern data-driven approaches struggle with data scarcity and distribution shifts. To address these fundamental limitations, we first propose SPARK, a physics-guided quantitative augmentation plugin. Specifically, SPARK utilizes a reconstruction autoencoder to integrate physical parameters into a physics-rich discrete state dictionary. This state dictionary then acts as a structured dictionary of physical states, enabling the creation of new, physically-plausible training samples via principled interpolation in the latent space. Further, for downstream prediction, these augmented representations are seamlessly integrated with a Fourier-enhanced Graph ODE, a combination designed to robustly model the enriched data distribution while capturing long-term temporal dependencies. Extensive experiments on diverse benchmarks demonstrate that SPARK significantly outperforms state-of-the-art baselines, particularly in challenging out-of-distribution scenarios and data-scarce regimes, proving the efficacy of our physics-guided augmentation paradigm.
☆ What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
Let \(\bm{A} \in \mathbb{R}^{m \times n}\) be an arbitrary, known matrix and \(\bm{e}\) a \(q\)-sparse adversarial vector. Given \(\bm{y} = \bm{A} x^* + \bm{e}\) and \(q\), we seek the smallest set containing \(x^*\)-hence the one conveying maximal information about \(x^*\)-that is uniformly recoverable from \(\bm{y}\) without knowing \(\bm{e}\). While exact recovery of \(x^*\) via strong (and often impractical) structural assumptions on \(\bm{A}\) or \(x^*\) (for example, restricted isometry, sparsity) is well studied, recoverability for arbitrary \(\bm{A}\) and \(x^*\) remains open. Our main result shows that the best that one can hope to recover is \(x^* + \ker(\bm{U})\), where \(\bm{U}\) is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of \(\bm{A}\) obtained by deleting \(2q\) rows. Moreover, we prove that every \(x\) that minimizes the \(\ell\_0\)-norm of \(\bm{y} - \bm{A} x\) lies in \(x^* + \ker(\bm{U})\), which then gives a constructive approach to recover this set.
☆ Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
comment: an early-stage version
☆ SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning NeurIPS 2025
Federated Learning has seen an increased deployment in real-world scenarios recently, as it enables the distributed training of machine learning models without explicit data sharing between individual clients. Yet, the introduction of the so-called gradient inversion attacks has fundamentally challenged its privacy-preserving properties. Unfortunately, as these attacks mostly rely on direct data optimization without any formal guarantees, the vulnerability of real-world systems remains in dispute and requires tedious testing for each new federated deployment. To overcome these issues, recently the SPEAR attack was introduced, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations. While SPEAR is an important theoretical breakthrough, the attack's practicality was severely limited by its exponential runtime in the batch size b. In this work, we fill this gap by applying State-of-the-Art techniques from Sparsely-Used Dictionary Learning to make the problem of gradient inversion on linear layers with ReLU activations tractable. Our experiments demonstrate that our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation, while being applicable to 10x bigger batch sizes.
comment: Published at the Workshop on Regulatable ML at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
☆ Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames
Behavioral cloning is a simple yet effective technique for learning sequential decision-making from demonstrations. Recently, it has gained prominence as the core of foundation models for the physical world, where achieving generalization requires countless demonstrations of a multitude of tasks. Typically, a human expert with full information on the task demonstrates a (nearly) optimal behavior. In this paper, we propose to hide some of the task's information from the demonstrator. This ``blindfolded'' expert is compelled to employ non-trivial exploration to solve the task. We show that cloning the blindfolded expert generalizes better to unseen tasks than its fully-informed counterpart. We conduct experiments of real-world robot peg insertion tasks with (limited) human demonstrations, alongside videogames from the Procgen benchmark. Additionally, we support our findings with theoretical analysis, which confirms that the generalization error scales with $\sqrt{I/m}$, where $I$ measures the amount of task information available to the demonstrator, and $m$ is the number of demonstrated tasks. Both theory and practice indicate that cloning blindfolded experts generalizes better with fewer demonstrated tasks. Project page with videos and code: https://sites.google.com/view/blindfoldedexperts/home
☆ Self-Concordant Perturbations for Linear Bandits
We study the adversarial linear bandits problem and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of $O(d\sqrt{n \ln n})$ on both the $d$-dimensional hypercube and the Euclidean ball. On the Euclidean ball, this matches the rate attained by existing self-concordant FTRL methods. For the hypercube, this represents a $\sqrt{d}$ improvement over these methods and matches the optimal bound up to logarithmic factors.
☆ V-SAT: Video Subtitle Annotation Tool
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.
☆ EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale NeurIPS 2025
Computationally resolving turbulence remains a central challenge in fluid dynamics due to its multi-scale interactions. Fully resolving large-scale turbulence through direct numerical simulation (DNS) is computationally prohibitive, motivating data-driven machine learning alternatives. In this work, we propose EddyFormer, a Transformer-based spectral-element (SEM) architecture for large-scale turbulence simulation that combines the accuracy of spectral methods with the scalability of the attention mechanism. We introduce an SEM tokenization that decomposes the flow into grid-scale and subgrid-scale components, enabling capture of both local and global features. We create a new three-dimensional isotropic turbulence dataset and train EddyFormer to achieves DNS-level accuracy at 256^3 resolution, providing a 30x speedup over DNS. When applied to unseen domains up to 4x larger than in training, EddyFormer preserves accuracy on physics-invariant metrics-energy spectra, correlation functions, and structure functions-showing domain generalization. On The Well benchmark suite of diverse turbulent flows, EddyFormer resolves cases where prior ML models fail to converge, accurately reproducing complex dynamics across a wide range of physical conditions.
comment: NeurIPS 2025
☆ Identifiable learning of dissipative dynamics
Complex dissipative systems appear across science and engineering, from polymers and active matter to learning algorithms. These systems operate far from equilibrium, where energy dissipation and time irreversibility are key to their behavior, but are difficult to quantify from data. Learning accurate and interpretable models of such dynamics remains a major challenge: the models must be expressive enough to describe diverse processes, yet constrained enough to remain physically meaningful and mathematically identifiable. Here, we introduce I-OnsagerNet, a neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring both interpretability and uniqueness. I-OnsagerNet extends the Onsager principle to guarantee that the learned potential is obtained from the stationary density and that the drift decomposes cleanly into time-reversible and time-irreversible components, as dictated by the Helmholtz decomposition. Our approach enables us to calculate the entropy production and to quantify irreversibility, offering a principled way to detect and quantify deviations from equilibrium. Applications to polymer stretching in elongational flow and to stochastic gradient Langevin dynamics reveal new insights, including super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with the strain rate, and the suppression of irreversibility with increasing batch size. I-OnsagerNet thus establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics.
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas NeurIPS 2025
Stellar mass is a fundamental quantity that determines the properties and evolution of stars. However, estimating stellar masses in star-forming regions is challenging because young stars are obscured by dense gas and the regions are highly inhomogeneous, making spherical dynamical estimates unreliable. Supervised machine learning could link such complex structures to stellar mass, but it requires large, high-quality labeled datasets from high-resolution magneto-hydrodynamical (MHD) simulations, which are computationally expensive. We address this by pretraining a vision transformer on one million synthetic fractal images using the self-supervised framework DINOv2, and then applying the frozen model to limited high-resolution MHD simulations. Our results demonstrate that synthetic pretraining improves frozen-feature regression stellar mass predictions, with the pretrained model performing slightly better than a supervised model trained on the same limited simulations. Principal component analysis of the extracted features further reveals semantically meaningful structures, suggesting that the model enables unsupervised segmentation of star-forming regions without the need for labeled data or fine-tuning.
comment: 6 pages, 3 figures, 1 table, accepted for NeurIPS 2025 ML4PS workshop
☆ Fixed Point Neural Acceleration and Inverse Surrogate Model for Battery Parameter Identification
The rapid expansion of electric vehicles has intensified the need for accurate and efficient diagnosis of lithium-ion batteries. Parameter identification of electrochemical battery models is widely recognized as a powerful method for battery health assessment. However, conventional metaheuristic approaches suffer from high computational cost and slow convergence, and recent machine learning methods are limited by their reliance on constant current data, which may not be available in practice. To overcome these challenges, we propose deep learning-based framework for parameter identification of electrochemical battery models. The proposed framework combines a neural surrogate model of the single particle model with electrolyte (NeuralSPMe) and a deep learning-based fixed-point iteration method. NeuralSPMe is trained on realistic EV load profiles to accurately predict lithium concentration dynamics under dynamic operating conditions while a parameter update network (PUNet) performs fixed-point iterative updates to significantly reduce both the evaluation time per sample and the overall number of iterations required for convergence. Experimental evaluations demonstrate that the proposed framework accelerates the parameter identification by more than 2000 times, achieves superior sample efficiency and more than 10 times higher accuracy compared to conventional metaheuristic algorithms, particularly under dynamic load scenarios encountered in practical applications.
comment: 31 pages, 11 figures, submitted to Applied Energy
☆ Causal Convolutional Neural Networks as Finite Impulse Response Filters
This study investigates the behavior of Causal Convolutional Neural Networks (CNNs) with quasi-linear activation functions when applied to time-series data characterized by multimodal frequency content. We demonstrate that, once trained, such networks exhibit properties analogous to Finite Impulse Response (FIR) filters, particularly when the convolutional kernels are of extended length exceeding those typically employed in standard CNN architectures. Causal CNNs are shown to capture spectral features both implicitly and explicitly, offering enhanced interpretability for tasks involving dynamic systems. Leveraging the associative property of convolution, we further show that the entire network can be reduced to an equivalent single-layer filter resembling an FIR filter optimized via least-squares criteria. This equivalence yields new insights into the spectral learning behavior of CNNs trained on signals with sparse frequency content. The approach is validated on both simulated beam dynamics and real-world bridge vibration datasets, underlining its relevance for modeling and identifying physical systems governed by dynamic responses.
comment: 14 pages, 19 figures, Under review
☆ Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation
Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
☆ HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
For doctors to truly trust artificial intelligence, it can't be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide--just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn't stop there. If a doctor ever asks, "Why?", HistoLens can instantly provide a 'visual proof' for any finding--a heatmap that points to the exact cells and regions the AI used for its analysis. We've also ensured the AI focuses only on the patient's tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.
☆ Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe that in modern large-modelinference, parameters and activations routinely move backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer. These memory-driven transfers inflate tail latency andviolate Service Level Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown under contention.We then formulate NoI synthesis as a multi-objective optimization(MOO) problem. We develop PARL (Partition-Aware ReinforcementLearner), a topology generator that balances throughput, latency,and power. PARL-generated topologies reduce contention at the memory cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining competitive mean throughput relative to link-rich meshes. Overall, this reframes NoI design for heterogeneouschiplet accelerators with workload-aware objectives.
☆ Enhancing Pre-trained Representation Classifiability can Boost its Interpretability ICLR 2025
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.
comment: ICLR 2025 (Spotlight)
☆ Learning Parameterized Skills from Demonstrations
We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
comment: Neurips 2025
☆ Information-Theoretic Discrete Diffusion NeurIPS 2025
We present an information-theoretic framework for discrete diffusion models that yields principled estimators of log-likelihood using score-matching losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive analogous results for the discrete setting. Specifically, we introduce the Information-Minimum Denoising Score Entropy (I-MDSE) relation, which links mutual information between data and its diffused version to the minimum denoising score entropy (DSE) loss. We extend this theory to masked diffusion and establish the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation, connecting cross-entropy losses to mutual information in discrete masked processes. These results provide a time-integral decomposition of the log-likelihood of the data in terms of optimal score-based losses, showing that commonly used losses such as DSE and DCE are not merely variational bounds but tight and principled estimators of log-likelihood. The I-MDCE decomposition further enables practical extensions, including time-free formula, conditional likelihood estimation in prompt-response tasks, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm the accuracy, variance stability, and utility of our estimators. The code is publicly available at https://github.com/Dongjae0324/infodis.
comment: Accepted at NeurIPS 2025
☆ Modeling Electric Vehicle Car-Following Behavior: Classical vs Machine Learning Approach
The increasing adoption of electric vehicles (EVs) necessitates an understanding of their driving behavior to enhance traffic safety and develop smart driving systems. This study compares classical and machine learning models for EV car following behavior. Classical models include the Intelligent Driver Model (IDM), Optimum Velocity Model (OVM), Optimal Velocity Relative Velocity (OVRV), and a simplified CACC model, while the machine learning approach employs a Random Forest Regressor. Using a real world dataset of an EV following an internal combustion engine (ICE) vehicle under varied driving conditions, we calibrated classical model parameters by minimizing the RMSE between predictions and real data. The Random Forest model predicts acceleration using spacing, speed, and gap type as inputs. Results demonstrate the Random Forest's superior accuracy, achieving RMSEs of 0.0046 (medium gap), 0.0016 (long gap), and 0.0025 (extra long gap). Among physics based models, CACC performed best, with an RMSE of 2.67 for long gaps. These findings highlight the machine learning model's performance across all scenarios. Such models are valuable for simulating EV behavior and analyzing mixed autonomy traffic dynamics in EV integrated environments.
☆ Deep Learning-Enhanced Calibration of the Heston Model: A Unified Framework
The Heston stochastic volatility model is a widely used tool in financial mathematics for pricing European options. However, its calibration remains computationally intensive and sensitive to local minima due to the model's nonlinear structure and high-dimensional parameter space. This paper introduces a hybrid deep learning-based framework that enhances both the computational efficiency and the accuracy of the calibration procedure. The proposed approach integrates two supervised feedforward neural networks: the Price Approximator Network (PAN), which approximates the option price surface based on strike and moneyness inputs, and the Calibration Correction Network (CCN), which refines the Heston model's output by correcting systematic pricing errors. Experimental results on real S\&P 500 option data demonstrate that the deep learning approach outperforms traditional calibration techniques across multiple error metrics, achieving faster convergence and superior generalization in both in-sample and out-of-sample settings. This framework offers a practical and robust solution for real-time financial model calibration.
☆ FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic NeurIPS 2025
Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the quantized backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3$\times$ training speedup over existing quantized LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON's end-to-end FP8 workflow removes the need for post-training quantization, facilitating efficient deployment. Code is available at https://github.com/iamkanghyunchoi/falqon.
comment: NeurIPS 2025
☆ PULSE: Privileged Knowledge Transfer from Electrodermal Activity to Low-Cost Sensors for Stress Monitoring ML4H 2025
Electrodermal activity (EDA), the primary signal for stress detection, requires costly hardware often unavailable in real-world wearables. In this paper, we propose PULSE, a framework that utilizes EDA exclusively during self-supervised pretraining, while enabling inference without EDA but with more readily available modalities such as ECG, BVP, ACC, and TEMP. Our approach separates encoder outputs into shared and private embeddings. We align shared embeddings across modalities and fuse them into a modality-invariant representation. The private embeddings carry modality-specific information to support the reconstruction objective. Pretraining is followed by knowledge transfer where a frozen EDA teacher transfers sympathetic-arousal representations into student encoders. On WESAD, our method achieves strong stress-detection performance, showing that representations of privileged EDA can be transferred to low-cost sensors to improve accuracy while reducing hardware cost.
comment: Accepted as a finders paper at ML4H 2025
☆ Copula-Stein Discrepancy: A Generator-Based Stein Operator for Archimedean Dependence
Kernel Stein discrepancies (KSDs) have become a principal tool for goodness-of-fit testing, but standard KSDs are often insensitive to higher-order dependency structures, such as tail dependence, which are critical in many scientific and financial domains. We address this gap by introducing the Copula-Stein Discrepancy (CSD), a novel class of discrepancies tailored to the geometry of statistical dependence. By defining a Stein operator directly on the copula density, CSD leverages the generative structure of dependence, rather than relying on the joint density's score function. For the broad class of Archimedean copulas, this approach yields a closed-form Stein kernel derived from the scalar generator function. We provide a comprehensive theoretical analysis, proving that CSD (i) metrizes weak convergence of copula distributions, ensuring it detects any mismatch in dependence; (ii) has an empirical estimator that converges at the minimax optimal rate of $O_P(n^{-1/2})$; and (iii) is provably sensitive to differences in tail dependence coefficients. The framework is extended to general non-Archimedean copulas, including elliptical and vine copulas. Computationally, the exact CSD kernel evaluation scales linearly in dimension, while a novel random feature approximation reduces the $n$-dependence from quadratic $O(n^2)$ to near-linear $\tilde{O}(n)$, making CSD a practical and theoretically principled tool for dependence-aware inference.
☆ Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
comment: 8 pages
☆ Low-N Protein Activity Optimization with FolDE
Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.
comment: 18 pages, 4 figures. Preprint. Open-source software available at https://github.com/JBEI/foldy
☆ Learning from History: A Retrieval-Augmented Framework for Spatiotemporal Prediction
Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system's intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system's local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model's predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
☆ Causal-Aware Generative Adversarial Networks with Reinforcement Learning
The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data's manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.
☆ Mitigating Negative Transfer via Reducing Environmental Disagreement
Unsupervised Domain Adaptation~(UDA) focuses on transferring knowledge from a labeled source domain to an unlabeled target domain, addressing the challenge of \emph{domain shift}. Significant domain shifts hinder effective knowledge transfer, leading to \emph{negative transfer} and deteriorating model performance. Therefore, mitigating negative transfer is essential. This study revisits negative transfer through the lens of causally disentangled learning, emphasizing cross-domain discriminative disagreement on non-causal environmental features as a critical factor. Our theoretical analysis reveals that overreliance on non-causal environmental features as the environment evolves can cause discriminative disagreements~(termed \emph{environmental disagreement}), thereby resulting in negative transfer. To address this, we propose Reducing Environmental Disagreement~(RED), which disentangles each sample into domain-invariant causal features and domain-specific non-causal environmental features via adversarially training domain-specific environmental feature extractors in the opposite domains. Subsequently, RED estimates and reduces environmental disagreement based on domain-specific non-causal environmental features. Experimental results confirm that RED effectively mitigates negative transfer and achieves state-of-the-art performance.
comment: 13 pages, 5 figures
☆ Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.
comment: 10 pages, 4 figures; submitted to The IEICE Transactions on Information and Systems
☆ Geometric Algorithms for Neural Combinatorial Optimization with Constraints
Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carath\'eodory's theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.
☆ Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.
☆ GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .
☆ Improved Accuracy of Robot Localization Using 3-D LiDAR in a Hippocampus-Inspired Model
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
comment: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025
☆ Spatio-temporal Multivariate Time Series Forecast with Chosen Variables
Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series of $n$ spatially distributed variables in a period of recent past to forecast their values in a period of near future. It has important applications in spatio-temporal sensing forecast such as road traffic prediction and air pollution prediction. Recent papers have addressed a practical problem of missing variables in the model input, which arises in the sensing applications where the number $m$ of sensors is far less than the number $n$ of locations to be monitored, due to budget constraints. We observe that the state of the art assumes that the $m$ variables (i.e., locations with sensors) in the model input are pre-determined and the important problem of how to choose the $m$ variables in the input has never been studied. This paper fills the gap by studying a new problem of STMF with chosen variables, which optimally selects $m$-out-of-$n$ variables for the model input in order to maximize the forecast accuracy. We propose a unified framework that jointly performs variable selection and model optimization for both forecast accuracy and model efficiency. It consists of three novel technical components: (1) masked variable-parameter pruning, which progressively prunes less informative variables and attention parameters through quantile-based masking; (2) prioritized variable-parameter replay, which replays low-loss past samples to preserve learned knowledge for model stability; (3) dynamic extrapolation mechanism, which propagates information from variables selected for the input to all other variables via learnable spatial embeddings and adjacency information. Experiments on five real-world datasets show that our work significantly outperforms the state-of-the-art baselines in both accuracy and efficiency, demonstrating the effectiveness of joint variable selection and model optimization.
comment: In submission
☆ Efficient Global-Local Fusion Sampling for Physics-Informed Neural Networks
The accuracy of Physics-Informed Neural Networks (PINNs) critically depends on the placement of collocation points, as the PDE loss is approximated through sampling over the solution domain. Global sampling ensures stability by covering the entire domain but requires many samples and is computationally expensive, whereas local sampling improves efficiency by focusing on high-residual regions but may neglect well-learned areas, reducing robustness. We propose a Global-Local Fusion (GLF) Sampling Strategy that combines the strengths of both approaches. Specifically, new collocation points are generated by perturbing training points with Gaussian noise scaled inversely to the residual, thereby concentrating samples in difficult regions while preserving exploration. To further reduce computational overhead, a lightweight linear surrogate is introduced to approximate the global residual-based distribution, achieving similar effectiveness at a fraction of the cost. Together, these components, residual-adaptive sampling and residual-based approximation, preserve the stability of global methods while retaining the efficiency of local refinement. Extensive experiments on benchmark PDEs demonstrate that GLF consistently improves both accuracy and efficiency compared with global and local sampling strategies. This study provides a practical and scalable framework for enhancing the reliability and efficiency of PINNs in solving complex and high-dimensional PDEs.
☆ NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.
☆ Discovering Heuristics with Large Language Models (LLMs) for Mixed-Integer Programs: Single-Machine Scheduling
Our study contributes to the scheduling and combinatorial optimization literature with new heuristics discovered by leveraging the power of Large Language Models (LLMs). We focus on the single-machine total tardiness (SMTT) problem, which aims to minimize total tardiness by sequencing n jobs on a single processor without preemption, given processing times and due dates. We develop and benchmark two novel LLM-discovered heuristics, the EDD Challenger (EDDC) and MDD Challenger (MDDC), inspired by the well-known Earliest Due Date (EDD) and Modified Due Date (MDD) rules. In contrast to prior studies that employed simpler rule-based heuristics, we evaluate our LLM-discovered algorithms using rigorous criteria, including optimality gaps and solution time derived from a mixed-integer programming (MIP) formulation of SMTT. We compare their performance against state-of-the-art heuristics and exact methods across various job sizes (20, 100, 200, and 500 jobs). For instances with more than 100 jobs, exact methods such as MIP and dynamic programming become computationally intractable. Up to 500 jobs, EDDC improves upon the classic EDD rule and another widely used algorithm in the literature. MDDC consistently outperforms traditional heuristics and remains competitive with exact approaches, particularly on larger and more complex instances. This study shows that human-LLM collaboration can produce scalable, high-performing heuristics for NP-hard constrained combinatorial optimization, even under limited resources when effectively configured.
☆ Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models NeurIPS 2025
Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.
comment: Accepted at NeurIPS 2025
☆ Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks NeurIPS 2025
Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.
comment: Accepted at NeurIPS 2025
☆ Auto-Adaptive PINNs with Applications to Phase Transitions
We propose an adaptive sampling method for the training of Physics Informed Neural Networks (PINNs) which allows for sampling based on an arbitrary problem-specific heuristic which may depend on the network and its gradients. In particular we focus our analysis on the Allen-Cahn equations, attempting to accurately resolve the characteristic interfacial regions using a PINN without any post-hoc resampling. In experiments, we show the effectiveness of these methods over residual-adaptive frameworks.
☆ Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept
Accurate, real-time estimation of barge quantity on inland waterways remains a critical challenge due to the non-self-propelled nature of barges and the limitations of existing monitoring systems. This study introduces a novel method to use Automatic Identification System (AIS) vessel tracking data to predict the number of barges in tow using Machine Learning (ML). To train and test the model, barge instances were manually annotated from satellite scenes across the Lower Mississippi River. Labeled images were matched to AIS vessel tracks using a spatiotemporal matching procedure. A comprehensive set of 30 AIS-derived features capturing vessel geometry, dynamic movement, and trajectory patterns were created and evaluated using Recursive Feature Elimination (RFE) to identify the most predictive variables. Six regression models, including ensemble, kernel-based, and generalized linear approaches, were trained and evaluated. The Poisson Regressor model yielded the best performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of the 30 features. The feature importance analysis revealed that metrics capturing vessel maneuverability such as course entropy, speed variability and trip length were most predictive of barge count. The proposed approach provides a scalable, readily implementable method for enhancing Maritime Domain Awareness (MDA), with strong potential applications in lock scheduling, port management, and freight planning. Future work will expand the proof of concept presented here to explore model transferability to other inland rivers with differing operational and environmental conditions.
☆ Optimal Arm Elimination Algorithms for Combinatorial Bandits
Combinatorial bandits extend the classical bandit framework to settings where the learner selects multiple arms in each round, motivated by applications such as online recommendation and assortment optimization. While extensions of upper confidence bound (UCB) algorithms arise naturally in this context, adapting arm elimination methods has proved more challenging. We introduce a novel elimination scheme that partitions arms into three categories (confirmed, active, and eliminated), and incorporates explicit exploration to update these sets. We demonstrate the efficacy of our algorithm in two settings: the combinatorial multi-armed bandit with general graph feedback, and the combinatorial linear contextual bandit. In both cases, our approach achieves near-optimal regret, whereas UCB-based methods can provably fail due to insufficient explicit exploration. Matching lower bounds are also provided.
☆ STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem
Operator eigenvalue problems play a critical role in various scientific fields and engineering applications, yet numerical methods are hindered by the curse of dimensionality. Recent deep learning methods provide an efficient approach to address this challenge by iteratively updating neural networks. These methods' performance relies heavily on the spectral distribution of the given operator: larger gaps between the operator's eigenvalues will improve precision, thus tailored spectral transformations that leverage the spectral distribution can enhance their performance. Based on this observation, we propose the Spectral Transformation Network (STNet). During each iteration, STNet uses approximate eigenvalues and eigenfunctions to perform spectral transformations on the original operator, turning it into an equivalent but easier problem. Specifically, we employ deflation projection to exclude the subspace corresponding to already solved eigenfunctions, thereby reducing the search space and avoiding converging to existing eigenfunctions. Additionally, our filter transform magnifies eigenvalues in the desired region and suppresses those outside, further improving performance. Extensive experiments demonstrate that STNet consistently outperforms existing learning-based methods, achieving state-of-the-art performance in accuracy.
☆ Score-based constrained generative modeling via Langevin diffusions with boundary conditions
Score-based generative models based on stochastic differential equations (SDEs) achieve impressive performance in sampling from unknown distributions, but often fail to satisfy underlying constraints. We propose a constrained generative model using kinetic (underdamped) Langevin dynamics with specular reflection of velocity on the boundary defining constraints. This results in piecewise continuously differentiable noising and denoising process where the latter is characterized by a time-reversed dynamics restricted to a domain with boundary due to specular boundary condition. In addition, we also contribute to existing reflected SDEs based constrained generative models, where the stochastic dynamics is restricted through an abstract local time term. By presenting efficient numerical samplers which converge with optimal rate in terms of discretizations step, we provide a comprehensive comparison of models based on confined (specularly reflected kinetic) Langevin diffusion with models based on reflected diffusion with local time.
☆ HyperGraphX: Graph Transductive Learning with Hyperdimensional Computing and Message Passing
We present a novel algorithm, \hdgc, that marries graph convolution with binding and bundling operations in hyperdimensional computing for transductive graph learning. For prediction accuracy \hdgc outperforms major and popular graph neural network implementations as well as state-of-the-art hyperdimensional computing implementations for a collection of homophilic graphs and heterophilic graphs. Compared with the most accurate learning methodologies we have tested, on the same target GPU platform, \hdgc is on average 9561.0 and 144.5 times faster than \gcnii, a graph neural network implementation and HDGL, a hyperdimensional computing implementation, respectively. As the majority of the learning operates on binary vectors, we expect outstanding energy performance of \hdgc on neuromorphic and emerging process-in-memory devices.
☆ Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM$_1$, PM$_{2.5}$, PM$_{10}$), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.
☆ Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models NeurIPS 2025
Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at https://github.com/aailab-kaist/DATE.
comment: Accepted at NeurIPS 2025
☆ An efficient probabilistic hardware architecture for diffusion-like models
The proliferation of probabilistic AI has promoted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.
comment: 9 pages, 6 figures
☆ A Pragmatic Way to Measure Chain-of-Thought Monitorability
While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.
comment: The first two authors contributed equally
☆ The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na\"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
☆ ChessQA: Evaluating Large Language Models for Chess Understanding
Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.
comment: 33 pages,8 figures
☆ Automating Benchmark Design
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark $\tau$-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.
☆ Graph Distance Based on Cause-Effect Estimands with Latents
Causal discovery aims to recover graphs that represent causal relations among given variables from observations, and new methods are constantly being proposed. Increasingly, the community raises questions about how much progress is made, because properly evaluating discovered graphs remains notoriously difficult, particularly under latent confounding. We propose a graph distance measure for acyclic directed mixed graphs (ADMGs) based on the downstream task of cause-effect estimation under unobserved confounding. Our approach uses identification via fixing and a symbolic verifier to quantify how graph differences distort cause-effect estimands for different treatment-outcome pairs. We analyze the behavior of the measure under different graph perturbations and compare it against existing distance metrics.
☆ Machine Learning based Analysis for Radiomics Features Robustness in Real-World Deployment Scenarios
Radiomics-based machine learning models show promise for clinical decision support but are vulnerable to distribution shifts caused by variations in imaging protocols, positioning, and segmentation. This study systematically investigates the robustness of radiomics-based machine learning models under distribution shifts across five MRI sequences. We evaluated how different acquisition protocols and segmentation strategies affect model reliability in terms of predictive power and uncertainty-awareness. Using a phantom of 16 fruits, we evaluated distribution shifts through: (1) protocol variations across T2-HASTE, T2-TSE, T2-MAP, T1-TSE, and T2-FLAIR sequences; (2) segmentation variations (full, partial, rotated); and (3) inter-observer variability. We trained XGBoost classifiers on 8 consistent robust features versus sequence-specific features, testing model performance under in-domain and out-of-domain conditions. Results demonstrate that models trained on protocol-invariant features maintain F1-scores >0.85 across distribution shifts, while models using all features showed 40% performance degradation under protocol changes. Dataset augmentation substantially improved the quality of uncertainty estimates and reduced the expected calibration error (ECE) by 35% without sacrificing accuracy. Temperature scaling provided minimal calibration benefits, confirming XGBoost's inherent reliability. Our findings reveal that protocol-aware feature selection and controlled phantom studies effectively predict model behavior under distribution shifts, providing a framework for developing robust radiomics models resilient to real-world protocol variations.
☆ Secure Retrieval-Augmented Generation against Poisoning Attacks
Large language models (LLMs) have transformed natural language processing (NLP), enabling applications from content generation to decision support. Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning, where the attacker injects poisoned texts into the knowledge database to manipulate system outputs. While various defenses have been proposed, they often struggle against advanced attacks. To address this, we introduce RAGuard, a detection framework designed to identify poisoned texts. RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks.
comment: To appear in IEEE BigData 2025
☆ Disentangling Shared and Private Neural Dynamics with SPIRE: A Latent Modeling Framework for Deep Brain Stimulation ICLR 2026
Disentangling shared network-level dynamics from region-specific activity is a central challenge in modeling multi-region neural data. We introduce SPIRE (Shared-Private Inter-Regional Encoder), a deep multi-encoder autoencoder that factorizes recordings into shared and private latent subspaces with novel alignment and disentanglement losses. Trained solely on baseline data, SPIRE robustly recovers cross-regional structure and reveals how external perturbations reorganize it. On synthetic benchmarks with ground-truth latents, SPIRE outperforms classical probabilistic models under nonlinear distortions and temporal misalignments. Applied to intracranial deep brain stimulation (DBS) recordings, SPIRE shows that shared latents reliably encode stimulation-specific signatures that generalize across sites and frequencies. These results establish SPIRE as a practical, reproducible tool for analyzing multi-region neural dynamics under stimulation.
comment: 25 pages total. Main paper (including references): 13 pages with 7 figures. Appendix: 12 pages with 5 figures and 4 tables. Submitted to ICLR 2026
☆ Towards Human-AI Synergy in Requirements Engineering: A Framework and Preliminary Study
The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.
comment: Accepted at the 2025 Sixth International Conference on Intelligent Data Science Technologies and Applications (IDSTA 2025),8 pages, 4 figures. Published in IEEE
☆ Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers
Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
comment: 9 pages, 10 figures
☆ Taming the Real-world Complexities in CPT E/M Coding with Large Language Models EMNLP 2025
Evaluation and Management (E/M) coding, under the Current Procedural Terminology (CPT) taxonomy, documents medical services provided to patients by physicians. Used primarily for billing purposes, it is in physicians' best interest to provide accurate CPT E/M codes. %While important, it is an auxiliary task that adds to physicians' documentation burden. Automating this coding task will help alleviate physicians' documentation burden, improve billing efficiency, and ultimately enable better patient care. However, a number of real-world complexities have made E/M encoding automation a challenging task. In this paper, we elaborate some of the key complexities and present ProFees, our LLM-based framework that tackles them, followed by a systematic evaluation. On an expert-curated real-world dataset, ProFees achieves an increase in coding accuracy of more than 36\% over a commercial CPT E/M coding system and almost 5\% over our strongest single-prompt baseline, demonstrating its effectiveness in addressing the real-world complexities.
comment: EMNLP 2025 Industry Track
☆ Cyclic Counterfactuals under Shift-Scale Interventions NeurIPS 2025
Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable's mechanism.
comment: Accepted at NeurIPS 2025
☆ Bayesian Neural Networks vs. Mixture Density Networks: Theoretical and Empirical Insights for Uncertainty-Aware Nonlinear Modeling
This paper investigates two prominent probabilistic neural modeling paradigms: Bayesian Neural Networks (BNNs) and Mixture Density Networks (MDNs) for uncertainty-aware nonlinear regression. While BNNs incorporate epistemic uncertainty by placing prior distributions over network parameters, MDNs directly model the conditional output distribution, thereby capturing multimodal and heteroscedastic data-generating mechanisms. We present a unified theoretical and empirical framework comparing these approaches. On the theoretical side, we derive convergence rates and error bounds under H\"older smoothness conditions, showing that MDNs achieve faster Kullback-Leibler (KL) divergence convergence due to their likelihood-based nature, whereas BNNs exhibit additional approximation bias induced by variational inference. Empirically, we evaluate both architectures on synthetic nonlinear datasets and a radiographic benchmark (RSNA Pediatric Bone Age Challenge). Quantitative and qualitative results demonstrate that MDNs more effectively capture multimodal responses and adaptive uncertainty, whereas BNNs provide more interpretable epistemic uncertainty under limited data. Our findings clarify the complementary strengths of posterior-based and likelihood-based probabilistic learning, offering guidance for uncertainty-aware modeling in nonlinear systems.
comment: 20 pages, 2 figures
☆ What Really Matters in Matrix-Whitening Optimizers?
A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, experiments reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.
☆ Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series
Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.
☆ scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration NeurIPS 2025
Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell's latent representations into modality-shared and modality-specific components using a well-designed $\beta$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-level datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.
comment: Accepted at NeurIPS 2025 (Spotlight)
☆ Epileptic Seizure Detection and Prediction from EEG Data: A Machine Learning Approach with Clinical Validation
In recent years, machine learning has become an increasingly powerful tool for supporting seizure detection and monitoring in epilepsy care. Traditional approaches focus on identifying seizures only after they begin, which limits the opportunity for early intervention and proactive treatment. In this study, we propose a novel approach that integrates both real-time seizure detection and prediction, aiming to capture subtle temporal patterns in EEG data that may indicate an upcoming seizure. Our approach was evaluated using the CHB-MIT Scalp EEG Database, which includes 969 hours of recordings and 173 seizures collected from 23 pediatric and young adult patients with drug-resistant epilepsy. To support seizure detection, we implemented a range of supervised machine learning algorithms, including K-Nearest Neighbors, Logistic Regression, Random Forest, and Support Vector Machine. The Logistic Regression achieved 90.9% detection accuracy with 89.6% recall, demonstrating balanced performance suitable for clinical screening. Random Forest and Support Vector Machine models achieved higher accuracy (94.0%) but with 0% recall, failing to detect any seizures, illustrating that accuracy alone is insufficient for evaluating medical ML models with class imbalance. For seizure prediction, we employed Long Short-Term Memory (LSTM) networks, which use deep learning to model temporal dependencies in EEG data. The LSTM model achieved 89.26% prediction accuracy. These results highlight the potential of developing accessible, real-time monitoring tools that not only detect seizures as traditionally done, but also predict them before they occur. This ability to predict seizures marks a significant shift from reactive seizure management to a more proactive approach, allowing patients to anticipate seizures and take precautionary measures to reduce the risk of injury or other complications.
comment: 9 pages, 3 figures
☆ LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies
Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.
☆ Strategic inputs: feature selection from game-theoretic perspective
The exponential growth of data volumes has led to escalating computational costs in machine learning model training. However, many features fail to contribute positively to model performance while consuming substantial computational resources. This paper presents an end-to-end feature selection framework for tabular data based on game theory. We formulate feature selection procedure based on a cooperative game where features are modeled as players, and their importance is determined through the evaluation of synergistic interactions and marginal contributions. The proposed framework comprises four core components: sample selection, game-theoretic feature importance evaluation, redundant feature elimination, and optimized model training. Experimental results demonstrate that the proposed method achieves substantial computation reduction while preserving predictive performance, thereby offering an efficient solution of the computational challenges of large-scale machine learning. The source code is available at https://github.com/vectorsss/strategy_inputs.
☆ Conformational Rank Conditioned Committees for Machine Learning-Assisted Directed Evolution
Machine Learning-assisted directed evolution (MLDE) is a powerful tool for efficiently navigating antibody fitness landscapes. Many structure-aware MLDE pipelines rely on a single conformation or a single committee across all conformations, limiting their ability to separate conformational uncertainty from epistemic uncertainty. Here, we introduce a rank -conditioned committee (RCC) framework that leverages ranked conformations to assign a deep neural network committee per rank. This design enables a principled separation between epistemic uncertainty and conformational uncertainty. We validate our approach on SARS-CoV-2 antibody docking, demonstrating significant improvements over baseline strategies. Our results offer a scalable route for therapeutic antibody discovery while directly addressing the challenge of modeling conformational uncertainty.
☆ Sequences of Logits Reveal the Low Rank Structure of Language Models
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
☆ Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms
While modern machine learning has transformed numerous application domains, its growing computational demands increasingly constrain scalability and efficiency, particularly on embedded and resource-limited platforms. In practice, neural networks must not only operate efficiently but also provide reliable predictions under distributional shifts or unseen data. Bayesian neural networks offer a principled framework for quantifying uncertainty, yet their computational overhead further compounds these challenges. This work advances resource-efficient and robust inference for both conventional and Bayesian neural networks through the joint pursuit of algorithmic and hardware efficiency. The former reduces computation through model compression and approximate Bayesian inference, while the latter optimizes deployment on digital accelerators and explores analog hardware, bridging algorithmic design and physical realization. The first contribution, Galen, performs automatic layer-specific compression guided by sensitivity analysis and hardware-in-the-loop feedback. Analog accelerators offer efficiency gains at the cost of noise; this work models device imperfections and extends noisy training to nonstationary conditions, improving robustness and stability. A second line of work advances probabilistic inference, developing analytic and ensemble approximations that replace costly sampling, integrate into a compiler stack, and optimize embedded inference. Finally, probabilistic photonic computing introduces a paradigm where controlled analog noise acts as an intrinsic entropy source, enabling fast, energy-efficient probabilistic inference directly in hardware. Together, these studies demonstrate how efficiency and reliability can be advanced jointly through algorithm-hardware co-design, laying the foundation for the next generation of trustworthy, energy-efficient machine-learning systems.
comment: Ph.D. dissertation, Heidelberg University, October 2025
☆ SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
☆ Finding Culture-Sensitive Neurons in Vision-Language Models
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
comment: 22 pages, 13 figures
☆ Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model's internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don't truly contribute to LLMs' prediction. We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model's prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.
☆ WBT-BGRL: A Non-Contrastive Weighted Bipartite Link Prediction Model for Inductive Learning SC
Link prediction in bipartite graphs is crucial for applications like recommendation systems and failure detection, yet it is less studied than in monopartite graphs. Contrastive methods struggle with inefficient and biased negative sampling, while non-contrastive approaches rely solely on positive samples. Existing models perform well in transductive settings, but their effectiveness in inductive, weighted, and bipartite scenarios remains untested. To address this, we propose Weighted Bipartite Triplet-Bootstrapped Graph Latents (WBT-BGRL), a non-contrastive framework that enhances bootstrapped learning with a novel weighting mechanism in the triplet loss. Using a bipartite architecture with dual GCN encoders, WBT-BGRL is evaluated against adapted state-of-the-art models (T-BGRL, BGRL, GBT, CCA-SSG). Results on real-world datasets (Industry and E-commerce) show competitive performance, especially when weighting is applied during pretraining-highlighting the value of weighted, non-contrastive learning for inductive link prediction in bipartite graphs.
comment: 5 pages, submitted to the 12th International Conference on Soft Computing and Machine Intelligence (ISCMI 2025)
☆ KAN-GCN: Combining Kolmogorov-Arnold Network with Graph Convolution Network for an Accurate Ice Sheet Emulator NeurIPS 2025
We introduce KAN-GCN, a fast and accurate emulator for ice sheet modeling that places a Kolmogorov-Arnold Network (KAN) as a feature-wise calibrator before graph convolution networks (GCNs). The KAN front end applies learnable one-dimensional warps and a linear mixing step, improving feature conditioning and nonlinear encoding without increasing message-passing depth. We employ this architecture to improve the performance of emulators for numerical ice sheet models. Our emulator is trained and tested using 36 melting-rate simulations with 3 mesh-size settings for Pine Island Glacier, Antarctica. Across 2- to 5-layer architectures, KAN-GCN matches or exceeds the accuracy of pure GCN and MLP-GCN baselines. Despite a small parameter overhead, KAN-GCN improves inference throughput on coarser meshes by replacing one edge-wise message-passing layer with a node-wise transform; only the finest mesh shows a modest cost. Overall, KAN-first designs offer a favorable accuracy vs. efficiency trade-off for large transient scenario sweeps.
comment: Accept for NeurIPS 2025 Workshop: New Perspectives in Graph Machine Learning
☆ Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning
In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities' contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.
☆ Topic Analysis with Side Information: A Neural-Augmented LDA Approach
Traditional topic models such as Latent Dirichlet Allocation (LDA) have been widely used to uncover latent structures in text corpora, but they often struggle to integrate auxiliary information such as metadata, user attributes, or document labels. These limitations restrict their expressiveness, personalization, and interpretability. To address this, we propose nnLDA, a neural-augmented probabilistic topic model that dynamically incorporates side information through a neural prior mechanism. nnLDA models each document as a mixture of latent topics, where the prior over topic proportions is generated by a neural network conditioned on auxiliary features. This design allows the model to capture complex nonlinear interactions between side information and topic distributions that static Dirichlet priors cannot represent. We develop a stochastic variational Expectation-Maximization algorithm to jointly optimize the neural and probabilistic components. Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification. These results highlight the benefits of combining neural representation learning with probabilistic topic modeling in settings where side information is available.
☆ Understanding Multi-View Transformers ICCV 2025
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
comment: Presented at the ICCV 2025 E2E3D Workshop
☆ Idea2Plan: Exploring AI-Powered Research Planning
Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.
☆ Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding
Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.
comment: 10 pages, 6 figures. Equal contribution: Shakeel Abdulkareem and Bora Yimenicioglu. Compiled with pdfLaTeX (wlscirep class)
☆ Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations NeurIPS 2025
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.
comment: Accepted as a Spotlight paper at NeurIPS 2025
☆ The Generation Phases of Flow Matching: a Denoising Perspective
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
☆ Send Less, Save More: Energy-Efficiency Benchmark of Embedded CNN Inference vs. Data Transmission in IoT
The integration of the Internet of Things (IoT) and Artificial Intelligence offers significant opportunities to enhance our ability to monitor and address ecological changes. As environmental challenges become increasingly pressing, the need for effective remote monitoring solutions is more critical than ever. A major challenge in designing IoT applications for environmental monitoring - particularly those involving image data - is to create energy-efficient IoT devices capable of long-term operation in remote areas with limited power availability. Advancements in the field of Tiny Machine Learning allow the use of Convolutional Neural Networks (CNNs) on resource-constrained, battery-operated microcontrollers. Since data transfer is energy-intensive, performing inference directly on microcontrollers to reduce the message size can extend the operational lifespan of IoT nodes. This work evaluates the use of common Low Power Wide Area Networks and compressed CNNs trained on domain specific datasets on an ESP32-S3. Our experiments demonstrate, among other things, that executing CNN inference on-device and transmitting only the results reduces the overall energy consumption by a factor of up to five compared to sending raw image data. %The compression of the model using Post Training Quantization is accompanied by an acceptable reduction in accuracy of only a few percentage points compared to a non-quantized model. These findings advocate the development of IoT applications with reduced carbon footprint and capable of operating autonomously in environmental monitoring scenarios by incorporating Embedded Machine Learning.
comment: 11 Pages, Paper lists the categories for the ACM Computing Classification System
☆ Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA NeurIPS 2025
Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.
comment: 56 apges, 18 figures, 8 tables, accepted as a conference paper at NeurIPS 2025
♻ ☆ Physics-Informed Latent Neural Operator for Real-time Predictions of time-dependent parametric PDEs
Deep operator network (DeepONet) has shown significant promise as surrogate models for systems governed by partial differential equations (PDEs), enabling accurate mappings between infinite-dimensional function spaces. However, when applied to systems with high-dimensional input-output mappings arising from large numbers of spatial and temporal collocation points, these models often require heavily overparameterized networks, leading to long training times. Latent DeepONet addresses some of these challenges by introducing a two-step approach: first learning a reduced latent space using a separate model, followed by operator learning within this latent space. While efficient, this method is inherently data-driven and lacks mechanisms for incorporating physical laws, limiting its robustness and generalizability in data-scarce settings. In this work, we propose PI-Latent-NO, a physics-informed latent neural operator framework that integrates governing physics directly into the learning process. Our architecture features two coupled DeepONets trained end-to-end: a Latent-DeepONet that learns a low-dimensional representation of the solution, and a Reconstruction-DeepONet that maps this latent representation back to the physical space. By embedding PDE constraints into the training via automatic differentiation, our method eliminates the need for labeled training data and ensures physics-consistent predictions. The proposed framework is both memory and compute-efficient, exhibiting near-constant scaling with problem size and demonstrating significant speedups over traditional physics-informed operator models. We validate our approach on a range of parametric PDEs, showcasing its accuracy, scalability, and suitability for real-time prediction in complex physical systems.
♻ ☆ DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving
The limited availability of high-quality training data poses a major obstacle in data-driven PDE solving, where expensive data collection and resolution constraints severely impact the ability of neural operator networks to learn and generalize the underlying physical system. To address this challenge, we propose DeltaPhi, a novel learning framework that transforms the PDE solving task from learning direct input-output mappings to learning the residuals between similar physical states, a fundamentally different approach to neural operator learning. This reformulation provides implicit data augmentation by exploiting the inherent stability of physical systems where closer initial states lead to closer evolution trajectories. DeltaPhi is architecture-agnostic and can be seamlessly integrated with existing neural operators to enhance their performance. Extensive experiments demonstrate consistent and significant improvements across diverse physical systems including regular and irregular domains, different neural architectures, multiple training data amount, and cross-resolution scenarios, confirming its effectiveness as a general enhancement for neural operators in data-limited PDE solving.
comment: Neurips 2025
♻ ☆ Datasheets for Machine Learning Sensors
Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These "ML sensors" enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial settings to wildlife tracking for conservation efforts. As such, there is a need to provide transparency in the operation of such ML-enabled sensing systems through comprehensive documentation. This is needed to enable their reproducibility, to address new compliance and auditing regimes mandated in regulation and industry-specific policy, and to verify and validate the responsible nature of their operation. To address this gap, we introduce the datasheet for ML sensors framework. We provide a comprehensive template, collaboratively developed in academia-industry partnerships, that captures the distinct attributes of ML sensors, including hardware specifications, ML model and dataset characteristics, end-to-end performance metrics, and environmental impacts. Our framework addresses the continuous streaming nature of sensor data, real-time processing requirements, and embeds benchmarking methodologies that reflect real-world deployment conditions, ensuring practical viability. Aligned with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), our approach enhances the transparency and reusability of ML sensor documentation across academic, industrial, and regulatory domains. To show the application of our approach, we present two datasheets: the first for an open-source ML sensor designed in-house and the second for a commercial ML sensor developed by industry collaborators, both performing computer vision-based person detection.
♻ ☆ ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
comment: Accepted to Neurips 2025
♻ ☆ SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that "more similar" zones have "higher probabilities" of sharing gradients with "larger attention weights." SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
♻ ☆ Hybrid Deep Learning Model to Estimate Cognitive Effort from fNIRS Signals
This study estimates cognitive effort based on functional near-infrared spectroscopy data and performance scores using a hybrid DeepNet model. The estimation of cognitive effort enables educators to modify material to enhance learning effectiveness and student engagement. In this study, we collected oxygenated hemoglobin using functional near-infrared spectroscopy during an educational quiz game. Participants (n=16) responded to 16 questions in a Unity-based educational game, each within a 30-second response time limit. We used DeepNet models to predict the performance score from the oxygenated hemoglobin, and compared traditional machine learning and DeepNet models to determine which approach provides better accuracy in predicting performance scores. The result shows that the proposed CNN-GRU gives better performance with 73% than other models. After the prediction, we used the predicted score and the oxygenated hemoglobin to observe cognitive effort by calculating relative neural efficiency and involvement in our test cases. Our result shows that even with moderate accuracy, the predicted cognitive effort closely follow the actual trends. This findings can be helpful in designing and improving learning environments and provide valuable insights into learning materials.
♻ ☆ Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation
Bayesian optimization relies on iteratively constructing and optimizing an acquisition function. The latter turns out to be a challenging, non-convex optimization problem itself. Despite the relative importance of this step, most algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This work investigates mixed-integer programming (MIP) as a paradigm for global acquisition function optimization. Specifically, our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) formulation introduces a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation for acquisition functions. The proposed method is applicable to uncertainty-based acquisition functions for any stationary or dot-product kernel. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework on synthetic functions, constrained benchmarks, and a hyperparameter tuning task.
comment: 18 pages, 4 figures, 5 tables
♻ ☆ Says Who? Effective Zero-Shot Annotation of Focalization
Focalization describes the way in which access to narrative information is restricted or controlled based on the knowledge available to knowledge of the narrator. It is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging. In this work, we test how well five contemporary large language model (LLM) families and two baselines perform when annotating short literary excerpts for focalization. Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we demonstrate that the log probabilities output by GPT-family models frequently reflect the difficulty of annotating particular excerpts. Finally, we provide a case study analyzing sixteen Stephen King novels, demonstrating the usefulness of this approach for computational literary studies and the insights gleaned from examining focalization at scale.
comment: Accepted at CHR 2025
♻ ☆ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
♻ ☆ FedMAP: Personalised Federated Learning for Real Large-Scale Healthcare Systems
Federated learning (FL) promises to enable collaborative machine learning across healthcare sites whilst preserving data privacy. Practical deployment remains limited by statistical heterogeneity arising from differences in patient demographics, treatments, and outcomes, and infrastructure constraints. We introduce FedMAP, a personalised FL (PFL) framework that addresses heterogeneity through local Maximum a Posteriori (MAP) estimation with Input Convex Neural Network priors. These priors represent global knowledge gathered from other sites that guides the model while adapting to local data, and we provide a formal proof of convergence. Unlike many PFL methods that rely on fixed regularisation, FedMAP's prior adaptively learns patterns that capture complex inter-site relationships. We demonstrate improved performance compared to local training, FedAvg, and several PFL methods across three large-scale clinical datasets: 10-year cardiovascular risk prediction (CPRD, 387 general practitioner practices, 258,688 patients), iron deficiency detection (INTERVAL, 4 donor centres, 31,949 blood donors), and mortality prediction (eICU, 150 hospitals, 44,842 patients). FedMAP incorporates a three-tier design that enables participation across healthcare sites with varying infrastructure and technical capabilities, from full federated training to inference-only deployment. Geographical analysis reveals substantial equity improvements, with underperforming regions achieving up to 14.3% performance gains. This framework provides the first practical pathway for large-scale healthcare FL deployment, which ensures clinical sites at all scales can benefit, equity is enhanced, and privacy is retained.
♻ ☆ GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding NeurIPS 2025
Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding -- where covariates are influenced by past treatments and, in turn, affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet), a theoretically grounded neural framework that combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences. GST-UNet explicitly adjusts for time-varying confounders and captures non-linear spatial and temporal dependencies, enabling valid causal inference from a single observed trajectory in data-scarce settings. We validate its effectiveness in synthetic experiments and in a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire. Together, these results position GST-UNet as a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.
comment: 29 pages, 6 figures, 6 tables, NeurIPS 2025
♻ ☆ Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling
Anomaly detection (AD) is increasingly recognized as a key component for ensuring the resilience of future communication systems. While deep learning has shown state-of-the-art AD performance, its application in critical systems is hindered by concerns regarding training data efficiency, domain adaptation and interpretability. This work considers AD in network flows using incomplete measurements, leveraging a robust tensor decomposition approach and deep unrolling techniques to address these challenges. We first propose a novel block-successive convex approximation algorithm based on a regularized model-fitting objective where the normal flows are modeled as low-rank tensors and anomalies as sparse. An augmentation of the objective is introduced to decrease the computational cost. We apply deep unrolling to derive a novel deep network architecture based on our proposed algorithm, treating the regularization parameters as learnable weights. Inspired by Bayesian approaches, we extend the model architecture to perform online adaptation to per-flow and per-time-step statistics, improving AD performance while maintaining a low parameter count and preserving the problem's permutation equivariances. To optimize the deep network weights for detection performance, we employ a homotopy optimization approach based on an efficient approximation of the area under the receiver operating characteristic curve. Extensive experiments on synthetic and real-world data demonstrate that our proposed deep network architecture exhibits a high training data efficiency, outperforms reference methods, and adapts seamlessly to varying network topologies.
comment: 18 pages, 7 figures
♻ ☆ Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining
Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in offering correctness guarantees for LLMs, but it faces major challenges in CDP: testing data often stems from unknown or shifting domain distributions, under which CP may no longer provide valid guarantees. Moreover, when high coverage is required, CP can yield excessively large prediction sets for unanswerable queries, reducing informativeness. To address these challenges, we introduce an adaptive rejection and non-exchangeable CP framework. Our method first estimates the distribution of questions across domains in the test set using transformer-based clustering, then reweights or resamples the calibration data accordingly. Building on this, adaptive rejection CP allows the LLM to selectively abstain from answering when its confidence or competence shifts significantly. Extensive experiments demonstrate that our framework enhances both the effectiveness and reliability of CP under CDP scenarios. Our code is available at: https://anonymous.4open.science/r/CPCL-8C12/
♻ ☆ GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection NeurIPS 2025
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ Online (Non-)Convex Learning via Tempered Optimism
Optimistic Online Learning aims to exploit experts conveying reliable information to predict the future. However, such implicit optimism may be challenged when it comes to practical crafting of such experts. A fundamental example consists in approximating a minimiser of the current problem and use it as expert. In the context of dynamic environments, such an expert only conveys partially relevant information as it may lead to overfitting. To tackle this issue, we introduce in this work the \emph{optimistically tempered} (OT) online learning framework designed to handle such imperfect experts. As a first contribution, we show that tempered optimism is a fruitful paradigm for Online Non-Convex Learning by proposing simple, yet powerful modification of Online Gradient and Mirror Descent. Second, we derive a second OT algorithm for convex losses and third, evaluate the practical efficiency of tempered optimism on real-life datasets and a toy experiment.
♻ ☆ Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning ICCV2025
So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code is available at https://github.com/aqeeelmirza/CoMet
comment: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV2025)
♻ ☆ $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior
While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
comment: aamas 2025
♻ ☆ Uni-LoRA: One Vector is All You Need NeurIPS 2025
Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a "one-vector-only" solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance. Our code is available at https://github.com/KaiyangLi1992/Uni-LoRA.
comment: NeurIPS 2025 Spotlight
♻ ☆ Group-in-Group Policy Optimization for LLM Agent Training NeurIPS 2025
Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
comment: NeurIPS 2025
♻ ☆ TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising NeurIPS 2025
Dark matter makes up approximately 85% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD -- a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the dark matter signal and produce real physics results thereby advancing fundamental science. The data downloading and associated analysis scripts are available at https://github.com/jessicafry/TIDMAD
comment: Accepted by NeurIPS 2025 (Spotlight)
♻ ☆ Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms
This paper introduces a broad class of Mirror Descent (MD) and Generalized Exponentiated Gradient (GEG) algorithms derived from trace-form entropies defined via deformed logarithms. Leveraging these generalized entropies yields MD \& GEG algorithms with improved convergence behavior, robustness to vanishing and exploding gradients, and inherent adaptability to non-Euclidean geometries through mirror maps. We establish deep connections between these methods and Amari's natural gradient, revealing a unified geometric foundation for additive, multiplicative, and natural gradient updates. Focusing on the Tsallis, Kaniadakis, Sharma--Taneja--Mittal, and Kaniadakis--Lissia--Scarfone entropy families, we show that each entropy induces a distinct Riemannian metric on the parameter space, leading to GEG algorithms that preserve the natural statistical geometry. The tunable parameters of deformed logarithms enable adaptive geometric selection, providing enhanced robustness and convergence over classical Euclidean optimization. Overall, our framework unifies key first-order MD optimization methods under a single information-geometric perspective based on generalized Bregman divergences, where the choice of entropy determines the underlying metric and dual geometric structure.
comment: 22 pages, 9 figures
♻ ☆ Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent's actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.
comment: Under review
♻ ☆ Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
♻ ☆ Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/SSBA_For_Counterfactuals
comment: This paper is 15 pages long consisting of multiple sections including an abstract, introduction, related works, methodology, results, ablation studies, conclusion, future works, and an appendix section. There are 10 figures and 5 tables in total
♻ ☆ Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
comment: 16 pages
♻ ☆ The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data
Differentially Private (DP) generative marginal models are often used in the wild to release synthetic tabular datasets in lieu of sensitive data while providing formal privacy guarantees. These models approximate low-dimensional marginals or query workloads; crucially, they require the training data to be pre-discretized, i.e., continuous values need to first be partitioned into bins. However, as the range of values (or their domain) is often inferred directly from the training data, with the number of bins and bin edges typically defined arbitrarily, this approach can ultimately break end-to-end DP guarantees and may not always yield optimal utility. In this paper, we present an extensive measurement study of four discretization strategies in the context of DP marginal generative models. More precisely, we design DP versions of three discretizers (uniform, quantile, and k-means) and reimplement the PrivTree algorithm. We find that optimizing both the choice of discretizer and bin count can improve utility, on average, by almost 30% across six DP marginal models, compared to the default strategy and number of bins, with PrivTree being the best-performing discretizer in the majority of cases. We demonstrate that, while DP generative models with non-private discretization remain vulnerable to membership inference attacks, applying DP during discretization effectively mitigates this risk. Finally, we improve on an existing approach for automatically selecting the optimal number of bins, and achieve high utility while reducing both privacy budget consumption and computational overhead.
♻ ☆ The Formalism-Implementation Gap in Reinforcement Learning Research
The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at "super-human levels". This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks -- thereby rendering them less useful -- which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered "saturated", can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.
♻ ☆ Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training
Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\tau_\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\tau_\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\tau_\mathrm{mem}$ increases linearly with the training set size $n$, while $\tau_\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.
comment: Accepted as an oral at Neurips 2025. 40 pages, 15 figures
♻ ☆ RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.
♻ ☆ Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.
♻ ☆ UniCrossFi: A Unified Framework For Cross-Domain Wi-Fi-based Gesture Recognition
Wi-Fi sensing systems are severely hindered by cross domain problem when deployed in unseen real-world environments. Existing methods typically design separate frameworks for either domain adaptation or domain generalization, often relying on extensive labeled data. Existing methods that designed for domain generalization is often relying on extensive labeled data. However, real-world scenarios are far more complex, where the deployed model must be capable of handling generalization under limited labeled source data. To this end, we propose UniCrossFi, a unified framework designed to mitigate performance drop in CSI-based sensing across diverse deployment settings. Our framework not only extends conventional Domain Generalization (DG) to a more practical Semi-Supervised Domain Generalization (SSDG) setting, where only partially labeled source data are available, but also introduces a physics-informed data augmentation strategy, Antenna Response Consistency (ARC). ARC mitigates the risk of learning superficial shortcuts by exploiting the intrinsic spatial diversity of multi-antenna systems, treating signals from different antennas as naturally augmented views of the same event. In addition, we design a Unified Contrastive Objective to prevent conventional contrastive learning from pushing apart samples from different domains that share the same class. We conduct extensive experiments on the public Widar and CSIDA datasets. The results demonstrate that UniCrossFi consistently establishes a new state-of-the-art, significantly outperforming existing methods across all unsupervised domain adaptation, DG, and SSDG benchmarks. UniCrossFi provides a principled and practical solution to the domain shift challenge, advancing the feasibility of robust, real-world Wi-Fi sensing systems that can operate effectively with limited labeled data.
♻ ☆ Telegrapher's Generative Model via Kac Flows
We break the mold in flow-based generative modeling by proposing a new model based on the damped wave equation, also known as telegrapher's equation. Similar to the diffusion equation and Brownian motion, there is a Feynman-Kac type relation between the telegrapher's equation and the stochastic Kac process in 1D. The Kac flow evolves stepwise linearly in time, so that the probability flow is Lipschitz continuous in the Wasserstein distance and, in contrast to diffusion flows, the norm of the velocity is globally bounded. Furthermore, the Kac model has the diffusion model as its asymptotic limit. We extend these considerations to a multi-dimensional stochastic process which consists of independent 1D Kac processes in each spatial component. We show that this process gives rise to an absolutely continuous curve in the Wasserstein space and compute the conditional velocity field starting in a Dirac point analytically. Using the framework of flow matching, we train a neural network that approximates the velocity field and use it for sample generation. Our numerical experiments demonstrate the scalability of our approach, and show its advantages over diffusion models.
comment: Update V2: We added CIFAR experiments. Update V3: The old FID scores & CIFAR images of the Kac model corresponded to the schedule g(t) = t. We now updated them with both schedules t and t^2. Update V4: We corrected a minor implementation error and updated the CIFAR images/table
♻ ☆ Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two--parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace--form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two--parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse--deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two--parameters logarithms.
comment: 10 pages, preprint of Journal paper
♻ ☆ Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization
Modern machine learning models are often trained in a setting where the number of parameters exceeds the number of training samples. To understand the implicit bias of gradient descent in such overparameterized models, prior work has studied diagonal linear neural networks in the regression setting. These studies have shown that, when initialized with small weights, gradient descent tends to favor solutions with minimal $\ell^1$-norm - an effect known as implicit regularization. In this paper, we investigate implicit regularization in diagonal linear neural networks of depth $D\ge 2$ for overparameterized linear regression problems. We focus on analyzing the approximation error between the limit point of gradient flow trajectories and the solution to the $\ell^1$-minimization problem. By deriving tight upper and lower bounds on the approximation error, we precisely characterize how the approximation error depends on the scale of initialization $\alpha$. Our results reveal a qualitative difference between depths: for $D \ge 3$, the error decreases linearly with $\alpha$, whereas for $D=2$, it decreases at rate $\alpha^{1-\varrho}$, where the parameter $\varrho \in [0,1)$ can be explicitly characterized. Interestingly, this parameter is closely linked to so-called null space property constants studied in the sparse recovery literature. We demonstrate the asymptotic tightness of our bounds through explicit examples. Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., $D \ge 3$, may lead to better generalization, particularly for realistic initialization scales.
♻ ☆ Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.
♻ ☆ Diffusion Models Meet Contextual Bandits
Efficient online decision-making in contextual bandits is challenging, as methods without informative priors often suffer from computational or statistical inefficiencies. In this work, we leverage pre-trained diffusion models as expressive priors to capture complex action dependencies and develop a practical algorithm that efficiently approximates posteriors under such priors, enabling both fast updates and sampling. Empirical results demonstrate the effectiveness and versatility of our approach across diverse contextual bandit settings.
comment: Neurips 2025
♻ ☆ Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning NeurIPS 2025
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io
comment: Accepted at NeurIPS 2025
♻ ☆ Provable Scaling Laws for the Test-Time Compute of Large Language Models NeurIPS 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
comment: NeurIPS 2025 camera-ready version
♻ ☆ Robust Point Cloud Reinforcement Learning via PCA-Based Canonicalization
Reinforcement Learning (RL) from raw visual input has achieved impressive successes in recent years, yet it remains fragile to out-of-distribution variations such as changes in lighting, color, and viewpoint. Point Cloud Reinforcement Learning (PC-RL) offers a promising alternative by mitigating appearance-based brittleness, but its sensitivity to camera pose mismatches continues to undermine reliability in realistic settings. To address this challenge, we propose PCA Point Cloud (PPC), a canonicalization framework specifically tailored for downstream robotic control. PPC maps point clouds under arbitrary rigid-body transformations to a unique canonical pose, aligning observations to a consistent frame, thereby substantially decreasing viewpoint-induced inconsistencies. In our experiments, we show that PPC improves robustness to unseen camera poses across challenging robotic tasks, providing a principled alternative to domain randomization.
♻ ☆ Offline Learning and Forgetting for Reasoning with Large Language Models
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025. Code: https://github.com/twni2016/llm-reasoning-uft
♻ ☆ FraudTransformer: Time-Aware GPT for Transaction Fraud Detection
Detecting payment fraud in real-world banking streams requires models that can exploit both the order of events and the irregular time gaps between them. We introduce FraudTransformer, a sequence model that augments a vanilla GPT-style architecture with (i) a dedicated time encoder that embeds either absolute timestamps or inter-event values, and (ii) a learned positional encoder that preserves relative order. Experiments on a large industrial dataset -- tens of millions of transactions and auxiliary events -- show that FraudTransformer surpasses four strong classical baselines (Logistic Regression, XGBoost and LightGBM) as well as transformer ablations that omit either the time or positional component. On the held-out test set it delivers the highest AUROC and PRAUC.
comment: Accepted in AI-FIND ICAIF'25 (https://sites.google.com/view/icaif-fraud-detection-workshop/home)
♻ ☆ ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NeurIPS 2025
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
comment: Accepted at NeurIPS 2025 (oral)
♻ ☆ Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
♻ ☆ LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
♻ ☆ Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation NeurIPS 2025
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar\'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr\'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
comment: Accepted to NeurIPS 2025
♻ ☆ Trade-offs in Data Memorization via Strong Data Processing Inequalities COLT 2025
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.
comment: Appeared in COLT 2025; this revision includes an improved upper bound in Theorem 3.1, as well as several minor clarifications and modifications
♻ ☆ Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review
Suicide remains a public health challenge, necessitating improved detection methods to facilitate timely intervention and treatment. This systematic review evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in assessing suicide risk through acoustic analysis of speech. Following PRISMA guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and Web of Science databases. The last search was conducted in February 2025. Risk of bias was assessed using the PROBAST tool. Studies analyzing acoustic features between individuals at risk of suicide (RS) and those not at risk (NRS) were included, while studies lacking acoustic data, a suicide-related focus, or sufficient methodological details were excluded. Sample sizes varied widely and were reported in terms of participants or speech segments, depending on the study. Results were synthesized narratively based on acoustic features and classifier performance. Findings consistently showed significant acoustic feature variations between RS and NRS populations, particularly involving jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC), and power spectral density (PSD). Classifier performance varied based on algorithms, modalities, and speech elicitation methods, with multimodal approaches integrating acoustic, linguistic, and metadata features demonstrating superior performance. Among the 29 classifier-based studies, reported AUC values ranged from 0.62 to 0.985 and accuracies from 60% to 99.85%. Most datasets were imbalanced in favor of NRS, and performance metrics were rarely reported separately by group, limiting clear identification of direction of effect.
comment: Preprint version of a manuscript submitted to the Journal of Affective Disorders
♻ ☆ MH-GIN: Multi-scale Heterogeneous Graph-based Imputation Network for AIS Data (Extended Version) VLDB 2026
Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available https://github.com/hyLiu1994/MH-GIN.
comment: 18 pages, 4 figures; This paper is accepted by PVLDB 2026
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.
♻ ☆ The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics
In recent years, the expressive power of various neural architectures -- including graph neural networks (GNNs), transformers, and recurrent neural networks -- has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These provide us with the first results on the logical expressiveness of temporal GNNs.
♻ ☆ MixAT: Combining Continuous and Discrete Adversarial Training for LLMs NeurIPS 2025
Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
comment: Published at 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ SeeDNorm: Self-Rescaled Dynamic Normalization
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.
comment: 31 pages, 14 figures, 18 tables
♻ ☆ DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment NeurIPS 2025
How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
comment: NeurIPS 2025
♻ ☆ Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression
In recent years, transfer learning has garnered significant attention in the machine learning community. Its ability to leverage knowledge from related studies to improve generalization performance in a target study has made it highly appealing. This paper focuses on investigating the transfer learning problem within the context of nonparametric regression over a reproducing kernel Hilbert space. The aim is to bridge the gap between practical effectiveness and theoretical guarantees. We specifically consider two scenarios: one where the transferable sources are known and another where they are unknown. For the known transferable source case, we propose a two-step kernel-based estimator by solely using kernel ridge regression. For the unknown case, we develop a novel method based on an efficient aggregation algorithm, which can automatically detect and alleviate the effects of negative sources. This paper provides the statistical properties of the desired estimators and establishes the minimax optimal rate. Through extensive numerical experiments on synthetic data and real examples, we validate our theoretical findings and demonstrate the effectiveness of our proposed method.
♻ ☆ Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem ICCV 2025
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
comment: Accepted for publication at the ICCV 2025 workshop - STREAM
♻ ☆ Clustering-Based Low-Rank Matrix Approximation for Medical Image Compression
Medical images are inherently high-resolution and contain locally varying structures crucial for diagnosis. Efficient compression must preserve diagnostic fidelity while minimizing redundancy. Low-rank matrix approximation (LoRMA) techniques have shown strong potential for image compression by capturing global correlations; however, they often fail to adapt to local structural variations across regions of interest. To address this, we introduce an adaptive LoRMA, which partitions a medical image into overlapping patches, groups structurally similar patches into clusters using k-means, and performs SVD within each cluster. We derive the overall compression factor accounting for patch overlap and analyze how patch size influences compression efficiency and computational cost. While applicable to any data with high local variation, we focus on medical imaging due to its pronounced local variability. We evaluate and compare our adaptive LoRMA against global SVD across four imaging modalities: MRI, ultrasound, CT scan, and chest X-ray. Results demonstrate that adaptive LoRMA effectively preserves structural integrity, edge details, and diagnostic relevance, measured by PSNR, SSIM, MSE, IoU, and EPI. Adaptive LoRMA minimizes block artifacts and residual errors, particularly in pathological regions, consistently outperforming global SVD in PSNR, SSIM, IoU, EPI, and achieving lower MSE. It prioritizes clinically salient regions while allowing aggressive compression in non-critical regions, optimizing storage efficiency. Although adaptive LoRMA requires higher processing time, its diagnostic fidelity justifies the overhead for high-compression applications.
♻ ☆ MARS-M: When Variance Reduction Meets Matrices
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.
♻ ☆ Geometric Mixture Models for Electrolyte Conductivity Prediction
Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.
♻ ☆ Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
♻ ☆ URB - Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles NeurIPS 2025
Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed using machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present URB: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. URB is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. URB comes with a catalog of predefined tasks, multi-agent RL (MARL) algorithm implementations, three baseline methods, domain-specific performance metrics, and a modular configuration scheme. Our results show that, despite the lengthy and costly training, state-of-the-art MARL algorithms rarely outperformed humans. The experimental results reported in this paper initiate the first leaderboard for MARL in large-scale urban routing optimization. They reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.
comment: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), Datasets and Benchmarks Track
♻ ☆ Do Language Models Use Their Depth Efficiently? NeurIPS 2025
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
comment: Accepted to NeurIPS 2025
♻ ☆ Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods. The source code is publicly available at https://github.com/gnueaj/Machine-Unlearning-Comparator.
comment: Submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG), under review. 15 pages. This work has been submitted to the IEEE for possible publication
♻ ☆ Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion NeurIPS 2025
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
comment: NeurIPS 2025
♻ ☆ Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model NeurIPS 2025
In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions. The code for this project can be found at https://github.com/kotatumuri-room/A2A-FM
comment: Accepted at NeurIPS 2025, 32 pages, 18 figures
♻ ☆ Scalable Exploration via Ensemble++ NeurIPS 2025
Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
comment: NeurIPS 2025
♻ ☆ FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.
comment: 49 pages, 29 figures, under review
♻ ☆ PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning NeurIPS 2025
In recent years, unlearning techniques, which are methods for inducing a model to "forget" previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.
comment: Accepted at NeurIPS 2025 Workshop: Evaluating the Evolving LLM Lifecycle
♻ ☆ Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach
Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.
♻ ☆ Rademacher Meets Colors: More Expressivity, but at What Cost ?
The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity -- a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.
♻ ☆ Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data
Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.
♻ ☆ CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
comment: Project Page: https://deepreinforce-ai.github.io/cudal1_blog/
♻ ☆ Mixture-of-Experts Meets In-Context Reinforcement Learning
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
comment: 28 pages, 13 figures, 17 tables
♻ ☆ An unsupervised tour through the hidden pathways of deep neural networks
The goal of this thesis is to improve our understanding of the internal mechanisms by which deep artificial neural networks create meaningful representations and are able to generalize. We focus on the challenge of characterizing the semantic content of the hidden representations with unsupervised learning tools, partially developed by us and described in this thesis, which allow harnessing the low-dimensional structure of the data. Chapter 2. introduces Gride, a method that allows estimating the intrinsic dimension of the data as an explicit function of the scale without performing any decimation of the data set. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among nearest data points. In Chapter 3, we study the evolution of the probability density across the hidden layers in some state-of-the-art deep neural networks. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant to classification. In subsequent layers, density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. This process leaves a footprint in the probability density of the output layer, where the topography of the peaks allows reconstructing the semantic relationships of the categories. In Chapter 4, we study the problem of generalization in deep neural networks: adding parameters to a network that interpolates its training data will typically improve its generalization performance, at odds with the classical bias-variance trade-off. We show that wide neural networks learn redundant representations instead of overfitting to spurious correlation and that redundant neurons appear only if the network is regularized and the training error is zero.
comment: PhD thesis
♻ ☆ Global urban visual perception varies across demographics and personalities
Understanding people's preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and personality traits -- shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators -- safe, lively, wealthy, beautiful, boring, depressing -- and four new ones -- live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
♻ ☆ A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning NeurIPS 2025
Multi-source transfer learning provides an effective solution to data scarcity in real- world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi- source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://github.com/zqy0126/OTQMS.
comment: NeurIPS 2025 Poster
♻ ☆ PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models
Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments. The code will be available at https://github.com/HeXiao-55/PTQTP.
comment: under review
♻ ☆ Selecting Critical Scenarios of DER Adoption in Distribution Grids Using Bayesian Optimization
We develop a new methodology to select scenarios of DER adoption most critical for distribution grids. Anticipating risks of future voltage and line flow violations due to additional PV adopters is central for utility investment planning but continues to rely on deterministic or ad hoc scenario selection. We propose a highly efficient search framework based on multi-objective Bayesian Optimization. We treat underlying grid stress metrics as computationally expensive black-box functions, approximated via Gaussian Process surrogates and design an acquisition function based on probability of scenarios being Pareto-critical across a collection of line- and bus-based violation objectives. Our approach provides a statistical guarantee and offers an order of magnitude speed-up relative to a conservative exhaustive search. Case studies on realistic feeders with 200-400 buses demonstrate the effectiveness and accuracy of our approach.
comment: 12 pages, 4 tables, 12 figures
♻ ☆ MathBode: Understanding LLM Reasoning with Dynamical Systems
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
♻ ☆ DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer
In this paper, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our DeepRTE framework demonstrates superior computational efficiency for solving the steady-state RTE, surpassing traditional methods and existing neural network approaches. This efficiency is achieved by embedding physical information through derivation of the RTE and mathematically-informed network architecture. Concurrently, DeepRTE achieves high accuracy with significantly fewer parameters, largely due to its incorporation of mechanisms such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural operator framework with inherent zero-shot capability. This is achieved by incorporating Green's function theory and pre-training with delta-function inflow boundary conditions into both its architecture design and training data construction. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.
♻ ☆ RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases NeurIPS 2025
Recent advances have demonstrated the effectiveness of graph-based learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges. Yet, effective modeling of RDBs into graphs remains challenging. Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs. In our analysis, we find that the best-performing graph model can yield up to a 10% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the 12 tasks about 380x faster than on-the-fly evaluation, which requires repeated GNN training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling. Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.
comment: Accepted at NeurIPS 2025 Datasets and Benchmarks Track
♻ ☆ Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48-62% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node's direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node's own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima.
♻ ☆ Multimodal 3D Genome Pre-training NeurIPS 2025
Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.
comment: NeurIPS 2025
♻ ☆ Federated Structured Sparse PCA for Anomaly Detection in IoT Networks
Although federated learning has gained prominence as a privacy-preserving framework tailored for distributed Internet of Things (IoT) environments, current federated principal component analysis (PCA) methods lack integration of sparsity, a critical feature for robust anomaly detection. To address this limitation, we propose a novel federated structured sparse PCA (FedSSP) approach for anomaly detection in IoT networks. The proposed model uniquely integrates double sparsity regularization: (1) row-wise sparsity governed by $\ell_{2,p}$-norm with $p\in [0,1)$ to eliminate redundant feature dimensions, and (2) element-wise sparsity via $\ell_{q}$-norm with $q\in [0,1)$ to suppress noise-sensitive components. To solve this nonconvex problem in a distributed setting, we devise an efficient optimization algorithm based on the proximal alternating minimization (PAM). Numerical experiments validate that incorporating structured sparsity enhances both model interpretability and detection accuracy. Our code is available at https://github.com/xianchaoxiu/FedSSP.
♻ ☆ Forecasting Outside the Box: Application-Driven Optimal Pointwise Forecasts for Stochastic Optimization
We study a class of two-stage stochastic programs, namely, those with fixed recourse matrix and fixed costs, and linear second stage. We show that, under mild assumptions, the problem can be solved with just one scenario, which we call an ``optimal scenario.'' Such a scenario does not have to be unique and may fall outside the support of the underlying distribution. Although finding an optimal scenario in general might be hard, we show that the result can be particularly useful in the case of stochastic optimization problems with contextual information, where the goal is to optimize the expected value of a certain function given some contextual information (e.g., previous demand, customer type, etc.) that accompany the main data of interest. The contextual information allows for a better estimation of the quantity of interest via machine learning methods. We focus on a class of learning methods -- sometimes called in the literature decision-focused learning -- that integrate the learning and optimization procedures by means of a bilevel optimization formulation, which determines the parameters for pointwise forecasts. By using the optimal scenario result, we prove that when such models are applied to the class of contextual two-stage problems considered in this paper, the pointwise forecasts computed from the bilevel optimization formulation actually yield asymptotically the best approximation of an optimal scenario within the modeler's pre-specified set of parameterized forecast functions. Numerical results conducted with inventory problems from the literature (with synthetic data) as well as a bike-sharing problem with real data demonstrate that the proposed approach performs well when compared to benchmark methods from the literature.
♻ ☆ PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM
We present PEARL (Peer-Enhanced Adaptive Radio via On-Device LLM), a framework for cooperative cross-layer optimization in device-to-device (D2D) communication. Building on our previous work on single-device on-device LLMs, PEARL extends the paradigm by leveraging both publisher and subscriber states to guide Wi-Fi Aware (WA) parameter selection. A context-aware reward, which normalizes latency by application tolerances and modulates energy by device battery states, provides richer supervision for KL-based finetuning. We study two lightweight variants: PEARL (Head + Low-Rank Adaptation (LoRA)) achieves the best overall performance, while PEARL-Lite (Head-only) delivers sub-20 ms inference at near-identical objective scores. Across synthetic scenarios grounded in real measurements, PEARL improves objective scores over heuristic and compact model baselines and reduces energy by up to 16% in cooperative low-battery cases. These results demonstrate that peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control. Code, real-world demo, and dataset are available at https://github.com/abman23/pearl
♻ ☆ MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
♻ ☆ MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
comment: 26 pages, 14 figures
♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose TrajAgent, an agent framework powered by large language models, designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on five tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in automated trajectory modeling, achieving a performance improvement of 2.38%-69.91% over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
comment: Accepted by NeurIPS 2025, https://github.com/tsinghua-fib-lab/TrajAgent
♻ ☆ MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
♻ ☆ FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.
♻ ☆ Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis
Generative AI can be framed as the problem of learning a model that maps simple reference measures into complex data distributions, and it has recently found a strong connection to the classical theory of the Schr\"odinger bridge problems (SBPs) due partly to their common nature of interpolating between prescribed marginals via entropy-regularized stochastic dynamics. However, the classical SBP enforces hard terminal constraints, which often leads to instability in practical implementations, especially in high-dimensional or data-scarce regimes. To address this challenge, we follow the idea of the so-called soft-constrained Schr\"odinger bridge problem (SCSBP), in which the terminal constraint is replaced by a general penalty function. This relaxation leads to a more flexible stochastic control formulation of McKean-Vlasov type. We establish the existence of optimal solutions for all penalty levels and prove that, as the penalty grows, both the controls and value functions converge to those of the classical SBP at a linear rate. Our analysis builds on Doob's h-transform representations, the stability results of Schr\"odinger potentials, Gamma-convergence, and a novel fixed-point argument that couples an optimization problem over the space of measures with an auxiliary entropic optimal transport problem. These results not only provide the first quantitative convergence guarantees for soft-constrained bridges but also shed light on how penalty regularization enables robust generative modeling, fine-tuning, and transfer learning.
comment: 31 pages
♻ ☆ Riemannian-Geometric Fingerprints of Generative Models ICCV 2025
Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.
comment: ICCV 2025 Highlight paper
♻ ☆ JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model NeurIPS 2025
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA
comment: NeurIPS 2025
♻ ☆ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay NeurIPS 2025
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.
comment: Accepted at NeurIPS 2025
♻ ☆ Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.
♻ ☆ ReCode: Unify Plan and Action for Universal Granularity Control
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.
♻ ☆ Unveiling Concept Attribution in Diffusion Models
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
♻ ☆ TraceTrans: Translation and Spatial Tracing for Surgical Prediction
Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
♻ ☆ CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types
Understanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with important implications for therapeutic discovery and precision medicine. Single-cell technologies enable high-resolution measurement of transcriptional responses, but collecting such data is costly and time-consuming, especially when repeated for each cell type. Existing computational methods often require separate models per cell type, limiting scalability and generalization. We present CFM-GP, a method for cell type-agnostic gene perturbation prediction. CFM-GP learns a continuous, time-dependent transformation between unperturbed and perturbed gene expression distributions, conditioned on cell type, allowing a single model to predict across all cell types. Unlike prior approaches that use discrete modeling, CFM-GP employs a flow matching objective to capture perturbation dynamics in a scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta stimulation, and Statefate progenitor fate mapping. CFM-GP consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation, and pathway enrichment analysis confirms recovery of key biological pathways. These results demonstrate the robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction.
comment: 28 Pages, 19 Tables, 8 Figures. The first two authors contributed equally
♻ ☆ Robustness is Important: Limitations of LLMs for Data Fitting
Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.
♻ ☆ RL-AUX: Reinforcement Learning for Auxiliary Task Generation
Auxiliary Learning (AL) is a special case of Multi-task Learning (MTL) in which a network trains on auxiliary tasks to improve performance on its main task. This technique is used to improve generalization and, ultimately, performance on the network's main task. AL has been demonstrated to improve performance across multiple domains, including navigation, image classification, and natural language processing. One weakness of AL is the need for labeled auxiliary tasks, which can require human effort and domain expertise to generate. Meta Learning techniques have been used to solve this issue by learning an additional auxiliary task generation network that can create helpful tasks for the primary network. The most prominent techniques rely on Bi-Level Optimization, which incurs computational cost and increased code complexity. To avoid the need for Bi-Level Optimization, we present an RL-based approach to dynamically create auxiliary tasks. In this framework, an RL agent is tasked with selecting auxiliary labels for every data point in a training set. The agent is rewarded when their selection improves the performance on the primary task. We also experiment with learning optimal strategies for weighing the auxiliary loss per data point. On the 20-Superclass CIFAR100 problem, our RL approach outperforms human-labeled auxiliary tasks and performs as well as a prominent Bi-Level Optimization technique. Our weight learning approaches significantly outperform all of these benchmarks. For example, a Weight-Aware RL-based approach helps the VGG16 architecture achieve 80.9% test accuracy while the human-labeled auxiliary task setup achieved 75.53%. The goal of this work is to (1) prove that RL is a viable approach to dynamically generate auxiliary tasks and (2) demonstrate that per-sample auxiliary task weights can be learned alongside the auxiliary task labels and can achieve strong results.
♻ ☆ Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach
Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data's covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.
♻ ☆ Your Dense Retriever is Secretly an Expeditious Reasoner
Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) can reformulate queries to capture complex reasoning, applying them universally incurs significant computational cost. In this work, we propose Adaptive Query Reasoning (AdaQR), a hybrid query rewriting framework. Within this framework, a Reasoner Router dynamically directs each query to either fast dense reasoning or deep LLM reasoning. The dense reasoning is achieved by the Dense Reasoner, which performs LLM-style reasoning directly in the embedding space, enabling a controllable trade-off between efficiency and accuracy. Experiments on large-scale retrieval benchmarks BRIGHT show that AdaQR reduces reasoning cost by 28% while preserving-or even improving-retrieval performance by 7%.
comment: 16 pages, 11 figures
♻ ☆ Learning Provably Improves the Convergence of Gradient Descent NeurIPS 2025
Learn to Optimize (L2O) trains deep neural network-based solvers for optimization, achieving success in accelerating convex problems and improving non-convex solutions. However, L2O lacks rigorous theoretical backing for its own training convergence, as existing analyses often use unrealistic assumptions -- a gap this work highlights empirically. We bridge this gap by proving the training convergence of L2O models that learn Gradient Descent (GD) hyperparameters for quadratic programming, leveraging the Neural Tangent Kernel (NTK) theory. We propose a deterministic initialization strategy to support our theoretical results and promote stable training over extended optimization horizons by mitigating gradient explosion. Our L2O framework demonstrates over 50% better optimality than GD and superior robustness over state-of-the-art L2O methods on synthetic datasets. The code of our method can be found from https://github.com/NetX-lab/MathL2OProof-Official.
comment: 48 pages, 11 figures, NeurIPS 2025
♻ ☆ Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
♻ ☆ CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots
In many real-world settings--e.g., single-cell RNA sequencing, mobility sensing, and environmental monitoring--data are observed only as temporally aggregated snapshots collected over finite time windows, often with noisy or uncertain timestamps, and without access to continuous trajectories. We study the problem of estimating continuous-time dynamics from such snapshots. We present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage framework that (i) infers high-resolution time labels by aligning neighboring intervals via partial optimal transport (POT) and (ii) reconstructs a continuous-time data distribution through temporal kernel smoothing, from which we sample pairs of nearby times to train standard ODE/SDE models. Our formulation explicitly accounts for snapshot aggregation and time-label uncertainty and uses practical accelerations (screening and mini-batch POT), making it applicable to large datasets. Across synthetic benchmarks and two real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional and trajectory errors compared with OT-CFM, [SF]\(^{2}\)M, TrajectoryNet, MFM, and ENOT.
♻ ☆ Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project NeurIPS 2025
Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number of samples in the dataset. We propose an approximate, distributed, accelerated sketch-and-project algorithm ($\texttt{ADASAP}$) for solving these linear systems, which improves scalability. We use the theory of determinantal point processes to show that the posterior mean induced by sketch-and-project rapidly converges to the true posterior mean. In particular, this yields the first efficient, condition number-free algorithm for estimating the posterior mean along the top spectral basis functions, showing that our approach is principled for GP inference. $\texttt{ADASAP}$ outperforms state-of-the-art solvers based on conjugate gradient and coordinate descent across several benchmark datasets and a large-scale Bayesian optimization task. Moreover, $\texttt{ADASAP}$ scales to a dataset with $> 3 \cdot 10^8$ samples, a feat which has not been accomplished in the literature.
comment: NeurIPS 2025; 31 pages, 7 figures, 2 tables
♻ ☆ Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model IROS 2025
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
comment: Accepted at IROS 2025. Project page: https://vita-epfl.github.io/DFI-OmniStereo-website/
♻ ☆ High-Energy Concentration for Federated Learning in Frequency Domain
Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha = 0.01$, FedFD achieves a minimum reduction of 37.78\% in the communication cost, while attaining a 10.88\% performance gain.
♻ ☆ CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
♻ ☆ Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models
Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.
♻ ☆ Seeding neural network quantum states with tensor network states
We find an efficient approach to approximately convert matrix product states (MPSs) into restricted Boltzmann machine wave functions consisting of a multinomial hidden unit through a canonical polyadic (CP) decomposition of the MPSs. This method allows us to generate well-behaved initial neural network quantum states for quantum many-body ground-state calculations in polynomial time of the number of variational parameters and systematically shorten the distance between the initial states and the ground states while increasing the rank of the CP decomposition. We demonstrate the efficiency of our method by taking the transverse-field Ising model as an example and discuss possible applications of our method to more general quantum many-body systems in which the ground-state wave functions possess complex nodal structures.
comment: 15 pages, 15 figures, All codes and data used in this manuscript are available at https://github.com/ryuikaneko/mps2rbm
♻ ☆ Tractable Shapley Values and Interactions via Tensor Networks
We show how to replace the O(2^n) coalition enumeration over n features behind Shapley values and Shapley-style interaction indices with a few-evaluation scheme on a tensor-network (TN) surrogate: TN-SHAP. The key idea is to represent a predictor's local behavior as a factorized multilinear map, so that coalitional quantities become linear probes of a coefficient tensor. TN-SHAP replaces exhaustive coalition sweeps with just a small number of targeted evaluations to extract order-k Shapley interactions. In particular, both order-1 (single-feature) and order-2 (pairwise) computations have cost O(n*poly(chi) + n^2), where chi is the TN's maximal cut rank. We provide theoretical guarantees on the approximation error and tractability of TN-SHAP. On UCI datasets, our method matches enumeration on the fitted surrogate while reducing evaluation by orders of magnitude and achieves 25-1000x wall-clock speedups over KernelSHAP-IQ at comparable accuracy, while amortizing training across local cohorts.
♻ ☆ Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
Humans are remarkably adept at collaboration, able to infer the strengths and weaknesses of new partners in order to work successfully towards shared goals. To build AI systems with this capability, we must first understand its building blocks: does such flexibility require explicit, dedicated mechanisms for modelling others -- or can it emerge spontaneously from the pressures of open-ended cooperative interaction? To investigate this question, we train simple model-free RNN agents to collaborate with a population of diverse partners. Using the `Overcooked-AI' environment, we collect data from thousands of collaborative teams, and analyse agents' internal hidden states. Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners' task abilities, enabling rapid adaptation and generalisation to novel collaborators. We investigated these internal models through probing techniques, and large-scale behavioural analysis. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation. Our results show that partner modelling can arise spontaneously in model-free agents -- but only under environmental conditions that impose the right kind of social pressure.
♻ ☆ NeuroPilot: A Realtime Brain-Computer Interface system to enhance concentration of students in online learning
The prevalence of online learning poses a vital challenge in real-time monitoring of students' concentration. Traditional methods such as questionnaire assessments require manual intervention, and webcam-based monitoring fails to provide accurate insights about learners' mental focus as it is deceived by mere screen fixation without cognitive engagement. Existing BCI-based approaches lack real-time validation and evaluation procedures. To address these limitations, a Brain-Computer Interface (BCI) system is developed using a non-invasive Electroencephalogram (EEG) headband, FocusCalm, to record brainwave activity under attentive and non-attentive states. 20 minutes of data were collected from each of 20 participants watching a pre-recorded educational video. The data validation employed a novel intra-video questionnaire assessment. Subsequently, collected signals were segmented (sliding window), filtered (Butterworth bandpass), and cleaned (removal of high-amplitude and EOG artifacts such as eye blinks). Time, frequency, wavelet, and statistical features were extracted, followed by recursive feature elimination (RFE) with support vector machines (SVMs) to classify attention and non-attention states. The leave-one-subject-out (LOSO) cross-validation accuracy was found to be 88.77%. The system provides feedback alerts upon detection of a non-attention state and maintains focus profile logs. A pilot study was conducted to evaluate the effectiveness of real-time feedback. Five participants underwent a 10-minute session comprising a 5-minute baseline phase devoid of feedback, succeeded by a 5-minute feedback phase, during which alerts were activated if participants exhibited inattention for approximately 8 consecutive seconds. A paired t-test (t = 5.73, p = 0.007) indicated a statistically significant improvement in concentration during the feedback phase.
comment: V2: Author list updated with consent from all authors. Minor grammatical errors fixed. This work has been submitted to the IEEE for possible publication
♻ ☆ Approximating the universal thermal climate index using sparse regression with orthogonal polynomials
This article explores novel data-driven modeling approaches for analyzing and approximating the Universal Thermal Climate Index (UTCI), a physiologically-based metric integrating multiple atmospheric variables to assess thermal comfort. Given the nonlinear, multivariate structure of UTCI, we investigate symbolic and sparse regression techniques as tools for interpretable and efficient function approximation. In particular, we highlight the benefits of using orthogonal polynomial bases-such as Legendre polynomials-in sparse regression frameworks, demonstrating their advantages in stability, convergence, and hierarchical interpretability compared to standard polynomial expansions. We demonstrate that our models achieve significantly lower root-mean squared losses than the widely used sixth-degree polynomial benchmark-while using the same or fewer parameters. By leveraging Legendre polynomial bases, we construct models that efficiently populate a Pareto front of accuracy versus complexity and exhibit stable, hierarchical coefficient structures across varying model capacities. Training on just 20% of the data, our models generalize robustly to the remaining 80%, with consistent performance under bootstrapping. The decomposition effectively approximates the UTCI as a Fourier-like expansion in an orthogonal basis, yielding results near the theoretical optimum in the L2 (least squares) sense. We also connect these findings to the broader context of equation discovery in environmental modeling, referencing probabilistic grammar-based methods that enforce domain consistency and compactness in symbolic expressions. Taken together, these results illustrate how combining sparsity, orthogonality, and symbolic structure enables robust, interpretable modeling of complex environmental indices like UTCI - and significantly outperforms the state-of-the-art approximation in both accuracy and efficiency.
♻ ☆ Meta-Learning Objectives for Preference Optimization
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.
♻ ☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
♻ ☆ AL-CoLe: Augmented Lagrangian for Constrained Learning
Despite the non-convexity of most modern machine learning parameterizations, Lagrangian duality has become a popular tool for addressing constrained learning problems. We revisit Augmented Lagrangian methods, which aim to mitigate the duality gap in non-convex settings while requiring only minimal modifications, and have remained comparably unexplored in constrained learning settings. We establish strong duality results under mild conditions, prove convergence of dual ascent algorithms to feasible and optimal primal solutions, and provide PAC-style generalization guarantees. Finally, we demonstrate its effectiveness on fairness constrained classification tasks.
♻ ☆ Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders
Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. Though this assumption, sequential ignorability/unconfoundedness, likely does not hold in observational data, most of the data that accounts for selection into treatment may be observed, motivating sensitivity analysis. We study robust policy evaluation and policy optimization in the presence of sequentially-exogenous unobserved confounders under a sensitivity model. We propose and analyze orthogonalized robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function, and adds a bias-correction to quantile estimation. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness both in simulations and on real-world longitudinal healthcare data of treating sepsis. In particular, our model of sequential unobserved confounders yields an online Markov decision process, rather than partially observed Markov decision process: we illustrate how this can enable warm-starting optimistic reinforcement learning algorithms with valid robust bounds from observational data.
comment: updated with new warmstarting, complex healthcare data case study
♻ ☆ Q-learning with Posterior Sampling
Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.
comment: Updated version
♻ ☆ Hyperparameters in Continual Learning: A Reality Check
Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
comment: TMLR 2025 camera ready version
♻ ☆ Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation NeurIPS 2025
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented -- enabling smaller student models to emulate the performance of much larger teachers -- the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage -- a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off in LLMs is found to be especially beneficial in scenarios where sample quality is more important than diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
comment: NeurIPS 2025 camera ready version
♻ ☆ Partially Observable Multi-Agent Reinforcement Learning with Information Sharing ICML 2023
We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
comment: Journal extension of the conference version at ICML 2023 accepted to SIAM Journal on Control and Optimization (SICON)
♻ ☆ An Adversarial-Driven Experimental Study on Deep Learning for RF Fingerprinting
Radio frequency (RF) fingerprinting, which extracts unique hardware imperfections of radio devices, has emerged as a promising physical-layer device identification mechanism in zero trust architectures and beyond 5G networks. In particular, deep learning (DL) methods have demonstrated state-of-the-art performance in this domain. However, existing approaches have primarily focused on enhancing system robustness against temporal and spatial variations in wireless environments, while the security vulnerabilities of these DL-based approaches have often been overlooked. In this work, we systematically investigate the security risks of DL-based RF fingerprinting systems through an adversarial-driven experimental analysis. We observe a consistent misclassification behavior for DL models under domain shifts, where a device is frequently misclassified as another specific one. Our analysis based on extensive real-world experiments demonstrates that this behavior can be exploited as an effective backdoor to enable external attackers to intrude into the system. Furthermore, we show that training DL models on raw received signals causes the models to entangle RF fingerprints with environmental and signal-pattern features, creating additional attack vectors that cannot be mitigated solely through post-processing security methods such as confidence thresholds.
♻ ☆ Mesh-Informed Neural Operator : A Transformer Generative Approach
Generative models in function spaces, situated at the intersection of generative modeling and operator learning, are attracting increasing attention due to their immense potential in diverse scientific and engineering applications. While functional generative models are theoretically domain- and discretization-agnostic, current implementations heavily rely on the Fourier Neural Operator (FNO), limiting their applicability to regular grids and rectangular domains. To overcome these critical limitations, we introduce the Mesh-Informed Neural Operator (MINO). By leveraging graph neural operators and cross-attention mechanisms, MINO offers a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces. This advancement significantly expands the scope of such models to more diverse applications in generative, inverse, and regression tasks. Furthermore, MINO provides a unified perspective on integrating neural operators with general advanced deep learning architectures. Finally, we introduce a suite of standardized evaluation metrics that enable objective comparison of functional generative models, addressing another critical gap in the field.
♻ ☆ Doubly Robust Alignment for Large Language Models NeurIPS 2025
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
comment: Accepted to NeurIPS 2025
♻ ☆ Jailbreak Transferability Emerges from Shared Representations
Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.
♻ ☆ Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing NeurIPS 2025
We study a sequential decision-making problem on a $n$-node graph $\mathcal{G}$ where each node has an unknown label from a finite set $\mathbf{\Omega}$, drawn from a joint distribution $\mathcal{P}$ that is Markov with respect to $\mathcal{G}$. At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when $\mathcal{G}$ is a forest. Our implementation runs in $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|^2)$ time while using $\mathcal{O}(n \cdot |\mathbf{\Omega}|^2)$ oracle calls to $\mathcal{P}$ and $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|)$ space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.
comment: Accepted into NeurIPS 2025
♻ ☆ CANDI: Hybrid Discrete-Continuous Diffusion Models
While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces. We include the code on the project page available here: https://patrickpynadath1.github.io/candi-lander
♻ ☆ What Causes Postoperative Aspiration?
Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model to predict postoperative aspiration, enabling timely preventative interventions. Methods: From the MIMIC-IV database of over 400,000 hospital admissions, we identified 826 surgical patients (mean age: 62, 55.7\% male) who experienced aspiration within seven days post-surgery, along with a matched non-aspiration cohort. Three ML models: XGBoost, Multilayer Perceptron, and Random Forest were trained using pre-surgical hospitalization data to predict postoperative aspiration. To investigate causation, we estimated Average Treatment Effects (ATE) using Augmented Inverse Probability Weighting. Results: Our ML model achieved an AUROC of 0.86 and 77.3\% sensitivity on a held-out test set. Maximum daily opioid dose, length of stay, and patient age emerged as the most important predictors. ATE analysis identified significant causative factors: opioids (0.25 +/- 0.06) and operative site (neck: 0.20 +/- 0.13, head: 0.19 +/- 0.13). Despite equal surgery rates across genders, men were 1.5 times more likely to aspirate and received 27\% higher maximum daily opioid dosages compared to women. Conclusion: ML models can effectively predict postoperative aspiration risk, enabling targeted preventative measures. Maximum daily opioid dosage and operative site significantly influence aspiration risk. The gender disparity in both opioid administration and aspiration rates warrants further investigation. These findings have important implications for improving postoperative care protocols and aspiration prevention strategies.
♻ ☆ Non-Markovian Discrete Diffusion with Causal Language Models NeurIPS 2025
Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
♻ ☆ How Many Ratings per Item are Necessary for Reliable Significance Testing?
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
♻ ☆ TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
comment: 15 pages, 7 figures, submitted to arXiv October 2025. All models, datasets, and training code will be released at https://huggingface.co/collections/utter-project/towervision
♻ ☆ Efficient Adaptive Experimentation with Noncompliance NeurIPS 2025
We study the problem of estimating the average treatment effect (ATE) in adaptive experiments where treatment can only be encouraged -- rather than directly assigned -- via a binary instrumental variable. Building on semiparametric efficiency theory, we derive the efficiency bound for ATE estimation under arbitrary, history-dependent instrument-assignment policies, and show it is minimized by a variance-aware allocation rule that balances outcome noise and compliance variability. Leveraging this insight, we introduce AMRIV -- an Adaptive, Multiply-Robust estimator for Instrumental-Variable settings with variance-optimal assignment. AMRIV pairs (i) an online policy that adaptively approximates the optimal allocation with (ii) a sequential, influence-function-based estimator that attains the semiparametric efficiency bound while retaining multiply-robust consistency. We establish asymptotic normality, explicit convergence rates, and anytime-valid asymptotic confidence sequences that enable sequential inference. Finally, we demonstrate the practical effectiveness of our approach through empirical studies, showing that adaptive instrument assignment, when combined with the AMRIV estimator, yields improved efficiency and robustness compared to existing baselines.
comment: 37 pages, 4 figures, 2 tables, NeurIPS 2025
♻ ☆ VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
comment: 29 pages, 40 figures, 3 algorithms
♻ ☆ Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction
The increasing digitalization of education presents unprecedented opportunities for data-driven personalization, but it also introduces significant challenges to student data privacy. Conventional recommender systems rely on centralized data, a paradigm often incompatible with modern data protection regulations. A novel privacy-preserving recommender system is proposed and evaluated to address this critical issue using Federated Learning (FL). The approach utilizes a Deep Neural Network (DNN) with rich, engineered features from the large-scale ASSISTments educational dataset. A rigorous comparative analysis of federated aggregation strategies was conducted, identifying FedProx as a significantly more stable and effective method for handling heterogeneous student data than the standard FedAvg baseline. The optimized federated model achieves a high-performance F1-Score of 76.28%, corresponding to 92% of the performance of a powerful, centralized XGBoost model. These findings validate that a federated approach can provide highly effective content recommendations without centralizing sensitive student data. Consequently, our work presents a viable and robust solution to the personalization-privacy dilemma in modern educational platforms.
comment: This paper has been prepared to be submitted to the Informatics in Education
♻ ☆ InsurTech innovation using natural language processing
With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations, focusing on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate feature de-biasing, feature compression, and industry classification in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classification techniques. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element of modern, data-driven insurance analytics.
♻ ☆ Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
♻ ☆ Graph Mixing Additive Networks
We introduce GMAN, a flexible, interpretable, and expressive framework that extends Graph Neural Additive Networks (GNANs) to learn from sets of sparse time-series data. GMAN represents each time-dependent trajectory as a directed graph and applies an enriched, more expressive GNAN to each graph. It allows users to control the interpretability-expressivity trade-off by grouping features and graphs to encode priors, and it provides feature, node, and graph-level interpretability. On real-world datasets, including mortality prediction from blood tests and fake-news detection, GMAN outperforms strong non-interpretable black-box baselines while delivering actionable, domain-aligned explanations.
comment: arXiv admin note: substantial text overlap with arXiv:2505.19193
♻ ☆ The Neural Pruning Law Hypothesis
Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most current pruning methods rely on ad-hoc heuristics that are poorly understood. We introduce Hyperflux, a conceptually-grounded pruning method, and use it to study the pruning process. Hyperflux models this process as an interaction between weight flux, the gradient's response to the weight's removal, and network pressure, a global regularization driving weights towards pruning. We postulate properties that arise naturally from our framework and find that the relationship between minimum flux among weights and density follows a power-law equation. Furthermore, we hypothesize the power-law relationship to hold for any effective saliency metric and call this idea the Neural Pruning Law Hypothesis. We validate our hypothesis on several families of pruning methods (magnitude, gradients, $L_0$), providing a potentially unifying property for neural pruning.
♻ ☆ Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.
comment: J. Chem. Inf. Model. 2025
Graphics 2
☆ Fast and accurate neural reflectance transformation imaging through knowledge distillation
Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...
comment: 18 pages
♻ ☆ Procedural Generation of Articulated Simulation-Ready Assets
We introduce Infinigen-Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.
comment: Updated to include information on newly implemented assets, new experimental results (both simulation and real world), and additional features including material and dynamics parameters