Logo UAOR

Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

arXiv 2026

Corresponding Author

TL;DR:

We propose UAOR, a training-free and plug-and-play module for VLA models. When the model exhibits high uncertainty (measured by Action Entropy), it reinjects observation features into the next layer's FFN through attention retrieval, enabling more confident and faithful action generation.

Abstract

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines.

Method

UAOR is a lightweight, training-free module designed to boost VLA models. It introduces two key components:

Uncertainty Measured by Action Entropy

We introduce Action Entropy, a VLA-specific metric that quantifies layer-wise uncertainty via the entropy of action-related output distributions. For each transformer layer, we project FFN outputs through the LM Head and compute entropy over action/condition tokens. Higher action entropy indicates greater uncertainty, reflecting the model's gradual "forgetting" of observation information during forward inference.

Layer-wise Uncertainty

Observation Reinjection with FFN

When uncertainty at layer ℓ exceeds a threshold γ, we reinject observation features into the FFN of the subsequent layer (ℓ+1). Concretely, we treat the encoded observation features as a key-value memory. We use the hidden states entering the FFN at layer ℓ+1 as queries to attend over this memory via an activation-weighted retrieval. The retrieved features are then blended with the original FFN output using a blending ratio α. This allows the model to dynamically "re-attend" to the observation when confusion arises, without halting or backtracking the inference.

UAOR Architecture

Figure 2: Detailed architecture of UAOR

Simulation Experiments

LIBERO

We evaluate UAOR on the LIBERO benchmark, which provides 4 task suites—Spatial, Object, Goal, and Long—each containing 10 tasks with 50 human-teleoperated demonstrations per task. We apply UAOR to two representative baselines: OpenVLA-OFT (7B) and π0 (3B). UAOR delivers consistent gains across all four suites: based on OpenVLA-OFT, it achieves a remarkable average success rate of 98.0% (+0.9), comparable to 3D-CAVLA (98.1%) but without auxiliary depth inputs, CoT reasoning, or fine-tuning. It also boosts π0 by +1.5 points on average.

LIBERO Results

Table 1: Performance comparison on the LIBERO benchmark.

LIBERO Task Demonstrations (OpenVLA-OFT w/ UAOR)

Spatial: Pick up the black bowl from table center and place it on the plate

Spatial: Pick up the black bowl on the wooden cabinet and place it on the plate

Object: Pick up the alphabet soup and place it in the basket

Object: Pick up the orange juice and place it in the basket

Goal: Open the middle drawer of the cabinet

Goal: Put the wine bottle on the rack

Long: Put both the alphabet soup and the tomato sauce in the basket

Long: Put the yellow and white mug in the microwave and close it

SIMPLER

We evaluate UAOR on the SIMPLER benchmark using CogACT (7B) as the baseline. UAOR raises the average success rate by +2.6 points (73.1 → 75.7; ~3.6% relative). The improvements are most evident on Pick coke can (+2.7), Open top drawer and place apple (+3.7), and Move near (+3.4), with a smaller gain on Open/Close drawer (+0.9). These tasks demand precise localization and placement under visual clutter.

SIMPLER Results

Table 2: Performance comparison on the SIMPLER benchmark.

CALVIN

We evaluate on the CALVIN ABC→D benchmark using LLaVA-VLA (0.5B). UAOR improves success on every track and increases the average consecutive completion length by +0.12 (3.55 → 3.67; ~3.4% relative). The consistent gains across progressively longer task chains indicate better maintenance of observation fidelity, leading to reduced uncertainty in downstream action prediction.

CALVIN Results

Table 3: Performance comparison on the CALVIN benchmark.

Real-World Experiments

We perform real-robot experiments with a Franka Research 3 robot arm equipped with a parallel-jaw gripper and a ZED 2i camera. We evaluate on four tasks: 1) Close the upper drawer, 2) Put the redbull on the plate, 3) Put the lion on the top shelf, and 4) Stand the coke can up. We fine-tune both OpenVLA-OFT and CogACT on each task using 50 expert trajectories and evaluate each task with 20 test rollouts.

For OpenVLA-OFT, UAOR achieves consistent performance improvements across all four tasks, with the average success rate increasing from 55.0% to 72.5% (+31.8% relative). The largest relative gain appears on the most challenging task, Stand the coke can up (+44.4% relative). For CogACT, UAOR boosts the average success rate from 63.8% to 78.8% (+23.5% relative). Notably, in the Put the redbull on the plate task, UAOR increases the success rate by an absolute 20%.

Real-World Results

Figure 3: Real-world evaluation results on both OpenVLA-OFT and CogACT.

Real-World Task Demonstrations (OpenVLA-OFT w/ UAOR)

Close the upper drawer

Put the redbull on the plate

Put the lion on the top shelf

Stand the coke can up

Citation


@article{yang2026uaor,
  title={UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models},
  author={Yang, Jiabing and Chen, Yixiang and Xu, Yuan and Li, Peiyan and Wu, Xiangnan and Wen, Zichen and Fang, Bowen and Yu, Tao and Zhang, Zhengbo and Li, Yingda and others},
  journal={arXiv preprint arXiv:2602.18020},
  year={2026}
}