Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing

Weitong Cai¹, Hang Zhang²^*, Yukai Huang³, Shitong Sun¹,
Jiankang Deng⁴, Songcen Xu⁵, Jifei Song⁵, Zhensong Zhang⁵^*

¹Queen Mary University of London, ²Independent Researcher, ³Durham University,
⁴Imperial College London, ⁵Huawei Noah's Ark Lab
^*Corresponding authors

CVPR 2026

Paper arXiv

Towards always-on sensing. (a) High-res RGB video pipelines quickly exhaust power on edge AI systems, smart glasses typically sustain only 30–60 min of continuous recording. (b) ColorTrigger uses a low-power grayscale camera as an always-on monitor and sparsely triggers an RGB camera only when needed, enabling always-on video sensing on edge devices.

Abstract

Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.

Key Finding: Color Is Not Always Necessary

(a) Performance vs. RGB ratio. Only a small fraction of RGB frames is sufficient to achieve comparable performance.

(b) Temporal attention map. CLS features from 30 consecutive frames reveal significant redundancy in adjacent frames.

We uniformly insert RGB frames into an otherwise grayscale video stream and evaluate on StreamingBench using Qwen2.5-VL-7B. Results show that only a small fraction of RGB frames is sufficient to achieve comparable performance. This indicates substantial redundancy in the color dimension, many semantics such as action recognition, layout reasoning, and counting are largely color-independent, with chromatic detail benefiting only a subset of key moments.

Method

ColorTrigger is a grayscale-always, color-on-demand framework that treats color as an on-demand signal rather than a continuously sampled modality. It comprises two key components:

Causal Online Trigger: Analyzes a sliding window of grayscale features to detect redundancy and novelty. A windowed grayscale affinity matrix captures recent temporal structure, and a lightweight quadratic program (QP) determines whether each frame warrants color activation, all in a strictly causal manner. A credit-budgeted controller regulates long-horizon color spending.
Dynamic Token Router: Allocates asymmetric decoder capacity, grayscale frames follow a high-compression pathway (64 tokens) while triggered RGB frames follow a high-capacity pathway (256 tokens). The mixed sequence is reassembled in temporal order and fed directly to the frozen MLLM decoder.

The entire pipeline is training-free, strictly causal, and integrates seamlessly with frozen MLLMs, making it practical for energy-constrained always-on video sensing.

Results on Streaming Video Benchmarks

We evaluate ColorTrigger on two popular streaming VideoQA benchmarks: StreamingBench and OVO-Bench. Following the streaming setting, ColorTrigger processes all historical frames accumulated up to the current timestamp when each question is posed, enabling real-time response without access to future content.

StreamingBench. ColorTrigger with 34.3% RGB frames scores 75.24, outperforming recent online model Dispider-7B and comparable to proprietary models such as Gemini 1.5 Pro (75.69), while surpassing GPT-4o (73.28) and Claude 3.5 Sonnet (72.44). Even with only 8.1% RGB frames (91.9% reduction), our approach scores 70.72, showing an 8.64% improvement over the grayscale-only baseline (62.08).

OVO-Bench. Our model with 33.1% RGB frames achieves an overall score of 52.5, outperforming almost all existing open-source online MLLMs. Notably, our Real-Time Visual Perception performance (65.2) shows an 11.4-point improvement over the grayscale-only baseline (53.8), highlighting the importance of selectively introducing chromatic information at critical moments. Even with only 7.1% RGB frames (92.9% reduction), ColorTrigger maintains a competitive overall score of 50.4.

Results on Offline Long Video Task

On Video-MME, ColorTrigger with 37.6% RGB frames achieves an overall score of 66.1, surpassing the full RGB baseline InternVL-3.5-8B at 65.6 while using 62.4% fewer chromatic frames. This demonstrates that our adaptive triggering mechanism not only reduces computational cost but can actually improve performance by focusing RGB capacity on semantically critical moments. Even with only 9.1% RGB frames, our model scores 62.8, a 5.5-point improvement over the grayscale-only baseline.

Ablation Study

Component Study. At 8.1% RGB usage, our full model (Grayscale + TrigRGB) achieves 70.72, recovering 91.6% of baseline performance. Removing continuous grayscale causes a 1.96-point drop, while replacing adaptive triggering with uniform sampling results in a 1.72-point degradation, demonstrating that both components are essential. At 34.3% RGB usage, our model scores 75.24 (97.5% recovery), consistently outperforming uniform sampling by 2.60 points.

Performance across varying RGB frame ratios. As target rate r increases from 0.05 to 1.0, overall accuracy increases monotonically, confirming that color is beneficial but not always necessary. Task-specific trends reveal heterogeneous color dependencies: Attribute Perception rapidly saturates at ~26% RGB ratio; Spatial Understanding remains largely flat (65.0–69.1%), indicating minimal chromatic information is needed; while Clips Summarization and Text-Rich Understanding exhibit gradual gains.

Token Efficiency Analysis. By introducing Dynamic Token Router (DT) with 64 tokens for grayscale frames, our model reduces total visual tokens to only 31.1% while scoring 70.72 (91.6% performance recovery). At 34.3% RGB usage, DT achieves 75.24 with 50.7% token consumption, closely matching the 256-token variant (76.32) that uses twice the computational resources.

Qualitative Example

With only grayscale input, the MLLM fails to answer "What color is the SUV?" (incorrectly predicting "Blue"). ColorTrigger effectively triggers and captures the relevant RGB frames, enabling the model to correctly identify the answer as "Black".

BibTeX

coming soon