LIBERO-Plus:
In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei^2,3,†, Siyin Wang^1,3,†,*, Junhao Shi^1,3,†, Zihao Dai^1,‡, Jikun Cai^1,‡, Pengfang Qian^1,3,‡,
Li Ji¹, Xinzhe He¹, Shiduo Zhang¹, Zhaoye Fei¹,
Jinlan Fu⁴, Jingjing Gong^3,✉, Xipeng Qiu^1,3,✉

^† Equal contribution ^‡ Equal contribution ^* Project lead ^✉ Corresponding authors
¹Fudan University, ²Tongji University, ³Shanghai Innovation Institute, ⁴National University of Sigapore

Paper Code

Assets

We introduce LIBERO-plus, an in-depth robustness analysis of Vision-Language-Action models. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: (i) objects layout, (ii) camera viewpoints, (iii) robot initial states, (iv) language instructions, (v) light conditions, (vi) background textures and (vii) sensor noise. Our findings challenge the assumption that high original LIBERO scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

Key Findings

Our systematic evaluation across seven perturbation dimensions reveals significant fragility in current VLA models. The table below summarizes model performance under different perturbations, where the first row for each model reports the task success rate (%) under each perturbation dimension (with "Original" indicating performance on unperturbed inputs), and the second row (denoted by ↓) shows the corresponding absolute performance drop. The results highlight substantial variations in robustness across models and perturbation types.

Model performance under different perturbations

Finding 1: Language Instructions are Largely Ignored

Models show surprising insensitivity to language perturbations

Contrary to expectations, language perturbations result in the smallest average performance drop (-25.3) across most models. This apparent robustness is counter-intuitive and merits deeper investigation.

Blank Instruction Test (a)

Surprisingly, even without any valid language input, the performance of some models remained largely unchanged. In practice, they degenerate into a form that disregards language, behaving more like a Vision-Action (VA) model.

Goal Replacement Test (b)

When target objects in instructions were replaced with alternatives, models continued to execute the original task, with success rates dropping nearly to zero in modified scenarios.

Key Insight: VLA models do not possess strong cross-object instruction-following generalization. They appear to rely more on fixed vision–action mappings than on fully exploiting language signals in task decision-making.

Finding 2: Models are Surprisingly Robust to Background and Lighting Changes

But the reasons are not as promising as they might seem

We observed that models exhibit surprising resilience to background changes and limited sensitivity to light variations. This raised important questions about what representations the models are actually learning.

Object Attention Analysis

Models demonstrate an ability to ignore distracting objects, but fail to generalize when target objects are displaced. This indicates they rely on memorized positional cues rather than learning invariant object semantics.

Illumination Robustness

Performance under light perturbations is limited because illumination changes primarily affect the third-person view and global appearance, whereas the wrist view remains relatively stable and provides critical close-range geometric cues.

Conclusion: The relative stability under background and lighting changes is largely attributable to the wrist camera's close-range perspective rather than sophisticated visual understanding.

Finding 3: Extreme Sensitivity to Camera Viewpoints and Robot Initial States

Models fail dramatically with minor changes in viewpoint or initial configuration

Models are most vulnerable to changes in camera viewpoint and robot initial state, which require a high-level understanding of spatial geometry and proprioception.

Camera Viewpoint Changes

Altering camera position, orientation, or field-of-view causes dramatic performance drops, revealing models' dependence on fixed visual perspectives rather than true 3D understanding.

Robot Initial State Variations

Changing the manipulator's initial pose significantly impacts success rates, indicating limited generalization across different configurations and a lack of deep kinematic understanding.

Conclusion: Current VLA models exhibit extreme sensitivity to perturbations in camera viewpoint and robot initial state, revealing fundamental limitations in their spatial reasoning capabilities.

Finding 4: Generalization Collapses Under Compositional Perturbations

Models fail catastrophically when multiple perturbations occur simultaneously

While single-dimension perturbations demonstrate some level of robustness, real-world scenarios often involve multiple simultaneous perturbations. We introduced the concept of Compositional Generalization Gap to quantitatively measure model performance under combined perturbations.

Below is the heatmap of conditional probabilities under pairwise perturbations. Upper triangular entries represent independence-based products of single-dimension probabilities, while lower triangular entries show actual joint outcomes.

Statistical Definition

We defined the Compositionality Gap as the covariance between perturbation variables given successful outcomes:

$$\Delta_{ij} = P(D_i=1, D_j=1 \mid Y=1) - P(D_i=1 \mid Y=1) \cdot P(D_j=1 \mid Y=1)$$

Where:

$D_i, D_j$: Indicator variables for applying perturbations
$Y$: Success indicator variable
$\Delta_{ij} < 0$ indicates negative interaction between perturbations

Negative Interaction Effects

Our experiments revealed consistent negative compositionality gaps, showing that:

Co-occurring perturbations act as coupled noise sources
Performance degradation is multiplicative rather than additive
Models lack mechanisms to capture higher-order dependencies

Conclusion: Current VLA models lack compositional generalization capabilities. Their learned representations are entangled and cannot handle the complex, multi-dimensional perturbations that characterize real-world environments.

Benchmark Leaderboard

Building on our in-depth robustness analysis, we introduce LIBERO-Plus, a comprehensive benchmark designed to establish a rigorous leaderboard for evaluating generalization capabilities across the key vulnerability dimensions identified in our study. The benchmark construction follows a systematic two-stage process: (1) expanding the original LIBERO benchmark through seven distinct perturbation factors, followed by task filtering and category balancing based on our empirical findings; and (2) evaluating the resulting tasks using four representative models and stratifying them into five difficulty levels (Level-1 to Level-5) according to observed accuracy distributions. This structured approach enables meaningful cross-model comparisons and establishes a standardized leaderboard for tracking progress in VLA robustness.

The figure below illustrates model performance across difficulty levels under four representative perturbation factors, providing insights into generalization capabilities under controlled distribution shifts.

We conducted a comprehensive review of existing studies evaluating generalization performance in VLA models, with particular focus on recent test suites. The table below provides a systematic comparison of these evaluation methodologies, highlighting their coverage across different perturbation dimensions and methodological approaches.

Finding 5: Training Data Diversity Significantly Improves Robustness

Systematic exposure to varied conditions enhances generalization

We constructed LIBERO-Plus, an extensive benchmark with 10,030 tasks spanning seven perturbation dimensions, and created a diverse training dataset with over 20,000 successful trajectories collected under systematically varied conditions that differ substantially from the evaluation scenarios.

Notable Camera Robustness Improvement

Our method achieved 92.8% success rate under camera perturbations, surpassing the next best model by 37.2 percentage points.

Broad Performance Gains

Significant improvements were also observed under noise (89.3%) and layout (77.6%) perturbations, demonstrating that training with varied data enhances robustness to a wide range of environmental variations.

Conclusion: Training strategies that emphasize diversity and exposure to varied data distributions consistently yield more robust models across multiple perturbation types.

Failure Case Study

Sample Rollout Videos

The following videos showcase various failure cases, illustrating how these 7D perturbations affect model performance.

Camera Viewpoints Change

Object Layouts Change

Robot Initial States Change

Light Conditions Change

Language Instructions Change

Sensor Noise

Background Textures Change

Overall Conclusion

Our findings challenge the assumption that high original LIBERO benchmark scores equate to true competency. Current VLA models remain brittle, showing particular vulnerability to camera and robot state changes, largely ignore language instructions, and exhibit positional bias rather than genuine semantic understanding.

We call upon the community to prioritize true diversity in evaluation practices and develop architectures capable of robust generalization beyond limited benchmark environments.

BibTeX

@article{fei25libero-plus,
    title={LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models},
    author={Senyu Fei and Siyin Wang and Junhao Shi and Zihao Dai and Jikun Cai and Pengfang Qian and Li Ji and Xinzhe He and Shiduo Zhang and Zhaoye Fei and Jinlan Fu and Jingjing Gong and Xipeng Qiu},
    journal = {arXiv preprint arXiv:2510.13626},
    year={2025},
}

LIBERO-Plus:In-depth Robustness Analysis of Vision-Language-Action Models