Leveraging Vision-Language Models for Efficient Understanding of Vulnerable Roadway Users via a Multimodal Traffic Sensing Approach

Term Start:

June 1, 2025

Term End:

May 31, 2026

Budget:

$180,000

Keywords:

Behavioral Detection, Traffic Monitoring, Vulnerable Road Users

Thrust Area(s):

Data Modeling and Analytic Tools, Understanding User Needs

University Lead:

City College of New York

Researcher(s):

Yiqiao Li; Jie Wei; Camille Kamga

The proliferation of 3D and video data from urban intersections offers a unique opportunity to analyze and protect vulnerable road users (VRUs). However, the effectiveness of modern detection models like PointPillar or CenterPoint is limited by the availability of high-quality labeled data. In Year 2, we demonstrated the feasibility of multimodal sensing using LiDAR and cameras. In Year 3, we propose a strategic shift toward adaptive self-learning traffic monitoring framework by leveraging multimodal large language models (MLLMs) to serve as both annotators and detectors, thereby minimizing the need for labor-intensive data annotation with a self-improving perception. This approach aims to improve the efficiency of VRU data collection and generate high-quality data to support a deeper understanding of micro-level VRU travel behavior.

This hybrid strategy includes two synergistic phases:

Phase 1 – MLLM-Assisted Annotation: We will explore the use of advanced Vision-Language Models (VLMs) such as Gemini and CLIP2Point to reduce manual annotation costs. Building on recent breakthroughs in Chain-of-Thought (CoT) and few-shot prompting, we will design in-context learning pipelines capable of generating high-fidelity annotations from minimal labeled examples. These pipelines will be applied to both LiDAR projections and camera footage, which produce detailed annotations that capture traffic-related objects (e.g., pedestrian, cyclist, buses, truck) with a specific focus on VRUs as well as their behaviors (e.g., jaywalking, crossing against signal) and interaction events.

Key tasks are: (i) Establish a rendering and preprocessing pipeline for point cloud and image integration; (ii) Design prompt structures and visual-question-answering tasks to guide the MLLM’s annotation; (iii) Develop a human-in-the-loop annotation tool to iteratively validate and improve the results with minimal human efforts.

Phase 2: MLLM-Based Detection and Self-Improvement: Leveraging the annotated data generated in Phase 1, we will then implement few-shot learning, RAG or Low-Rank Adaptation (LoRA) based fine-tuning frameworks, and iterative prompting to adapt MLLMs into agentic object detectors and scene interpreters. This phase transitions MLLMs from passive annotators to active detection agents, bypassing traditional labor-intensive supervised training processes.

Key tasks are: (i) Evaluate the performance of zero-shot and few-shot detection; (ii) Use retrieval-based augmentations to enhance context in challenging scenes; (iii) Perform a comparative analysis between our proposed method and established deep learning models such as PointPillars- and VoxelNets-based architectures; (iv) Explore feedback close loops where MLLMs refine their performance using their own corrected outputs.

Scroll to Top