VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving

Purdue University – West Lafayette
University of Wisconsin – Madison
Conference name and year

Indicates Equal Contribution

*Corresponding Author

Abstract

Reinforcement learning (RL)-based autonomous driving policy learning faces critical limitations such as low sample efficiency and poor generalization; its reliance on online interactions and trial-and-error learning is especially unacceptable in safety-critical scenarios. Existing methods, such as safe RL, often fail to capture the true semantic meaning of "safety" in complex driving contexts, leading to either overly conservative behavior or constraint violations. To address these challenges, we propose VL-SAFE, a world model-based safe RL framework with Vision-Language model (VLM)-as safety guidance paradigm, designed for offline safe policy learning. Specifically, we construct offline datasets collected by expert agents and labeled with safety scores derived from VLMs. A world model is trained to generate imagined rollouts together with safety estimations, allowing the agent to perform safe planning without interacting with the real environment. Based on these imagined trajectories and safety evaluations, actor-critic learning is conducted under VLM-based safety guidance to optimize the driving policy more safely and efficiently. Extensive evaluations demonstrate that VL-SAFE achieves superior sample efficiency, generalization, safety, and overall performance compared to existing baselines. To the best of our knowledge, this is the first work that introduces a VLM-guided world model-based approach for safe autonomous driving. The demo video and code can be accessed at: https://ys-qu.github.io/vlsafe-website/

Comparisons between our proposed method and related work.


Descriptive Alt Text

(a) Offline RL, removes the need for online exploration entirely by learning from pre-collected driving data, thus avoiding the safety concerns of real-time trial-and-error; but it still fails occasionally in online testing due to no safety constraints.

(b) World models can reduce interactions by enabling policy learning through imagined rollouts, this can largely reduce risky actions and improve sample efficiency; but world models can still make risky actions at the initial training stage.

(c) Safe RL methods explicitly incorporate cost constraints or risk-sensitive objectives into the learning process to restrict unsafe behaviors and ensure long-term safety; safe RL might learn a too conservative policy and unable to generalize due to the lack of semantic understanding of the whole context.

(d) It is intuitive to combine these methods together to complement each other's strengths and weaknesses. However, more importantly, at the core of these challenges lies a fundamental question: how can we identify risky states, semantically understand “safety”, and guide policy learning accordingly?

An example from life to prove the effectiveness of VLM-as safety guidance paradigm.


Descriptive Alt Text

We introduce our motivation starting from an intuitive life example: imagine an autonomous vehicle approaching a stopped school bus on a multi-lane road, where several children are walking across the street toward the sidewalk. The school bus has its STOP sign extended, indicating that all surrounding vehicles must come to a complete stop.

A classical RL-based or offline-trained agent may continue driving or only slightly slow down, as no collision has occurred and the reward function still favors progress. A world model-based agent may also ignore the school bus and its STOP sign during imagined rollouts, failing to distinguish the risk embedded in this visual context. A traditional safe RL policy may behave unpredictably: it may stop in some cases, but often fails to recognize the raised STOP sign and the presence of children, especially if such visual cues were not explicitly encoded in the cost function or training data. This can lead to unsafe decisions or over-generalized conservativeness, such as treating harmless objects like parked vehicles or cones as equivalent threats.

In contrast, a VLM-guided policy can semantically understand the scene: the raised STOP sign on the school bus and the crossing children clearly indicate a high-risk situation. Even in the absence of a collision, the VLM assigns a low safety score based on visual semantics, guiding the agent to stop proactively and yield to the pedestrians. This enables the agent to maintain appropriate caution in dangerous scenarios without falling into the trap of indiscriminate conservativeness.

This example highlights the advantage of semantic safety reasoning: it enables agents to make early, context-aware, and proportionate decisions, ultimately allowing them to act safely across varied scenarios.

Overall framework of VL-SAFE.


Descriptive Alt Text

The proposed method VL-SAFE has two phases, where the first phase generates ground truth safety estimations using CLIPs for each state in offline dataset collected by expert agent, and the second stage will learn a safety-aware world model to generate imagined rollouts along with predicted safety estimations for actor-critic learning.

In our approach, we similarly assign weights to each state-action pair in the imagined rollouts. The weight term combines both reward and cost advantages, and is modulated by a VLM-derived safety probability, which dynamically balances the influence of reward-seeking and cost-avoidance. The intuition is that the agent should prioritize reward maximization in safe conditions, and focus on cost minimization when encountering potentially risky situations, where the timing and degree of this tradeoff is guided by the semantic understanding provided by the VLM.

Driving scenarios and maps.


Descriptive Alt Text

All tasks in the CarDreamer simulation platform are conducted on Town 03 and Town 04 maps under CARLA, which provide realistic urban and suburban layouts with diverse traffic elements. As shown in Figure above, the scenarios cover various road geometries and traffic situations essential for testing generalization and robustness in autonomous policies.

Comparison results.


Descriptive Alt Text

Compared to traditional methods, our framework yields superior performance in terms of driving safety, sampling efficiency, and generalizability.

Demonstrations


Four Lane

Four Lane

Left Turn

Left Turn

Right Turn

Right Turn

Navigation

Navigation

This section showcases our agent’s ability to handle multiple driving tasks.

Video Presentation

Poster