Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.
(a) Performance vs. RGB ratio. Only a small fraction of RGB frames is sufficient to achieve comparable performance.
(b) Temporal attention map. CLS features from 30 consecutive frames reveal significant redundancy in adjacent frames.
We uniformly insert RGB frames into an otherwise grayscale video stream and evaluate on StreamingBench using Qwen2.5-VL-7B. Results show that only a small fraction of RGB frames is sufficient to achieve comparable performance. This indicates substantial redundancy in the color dimension, many semantics such as action recognition, layout reasoning, and counting are largely color-independent, with chromatic detail benefiting only a subset of key moments.
ColorTrigger is a grayscale-always, color-on-demand framework that treats color as an on-demand signal rather than a continuously sampled modality. It comprises two key components:
The entire pipeline is training-free, strictly causal, and integrates seamlessly with frozen MLLMs, making it practical for energy-constrained always-on video sensing.
We evaluate ColorTrigger on two popular streaming VideoQA benchmarks: StreamingBench and OVO-Bench. Following the streaming setting, ColorTrigger processes all historical frames accumulated up to the current timestamp when each question is posed, enabling real-time response without access to future content.
StreamingBench. ColorTrigger with 34.3% RGB frames scores 75.24, outperforming recent online model Dispider-7B and comparable to proprietary models such as Gemini 1.5 Pro (75.69), while surpassing GPT-4o (73.28) and Claude 3.5 Sonnet (72.44). Even with only 8.1% RGB frames (91.9% reduction), our approach scores 70.72, showing an 8.64% improvement over the grayscale-only baseline (62.08).
OVO-Bench. Our model with 33.1% RGB frames achieves an overall score of 52.5, outperforming almost all existing open-source online MLLMs. Notably, our Real-Time Visual Perception performance (65.2) shows an 11.4-point improvement over the grayscale-only baseline (53.8), highlighting the importance of selectively introducing chromatic information at critical moments. Even with only 7.1% RGB frames (92.9% reduction), ColorTrigger maintains a competitive overall score of 50.4.
On Video-MME, ColorTrigger with 37.6% RGB frames achieves an overall score of 66.1, surpassing the full RGB baseline InternVL-3.5-8B at 65.6 while using 62.4% fewer chromatic frames. This demonstrates that our adaptive triggering mechanism not only reduces computational cost but can actually improve performance by focusing RGB capacity on semantically critical moments. Even with only 9.1% RGB frames, our model scores 62.8, a 5.5-point improvement over the grayscale-only baseline.
Component Study. At 8.1% RGB usage, our full model (Grayscale + TrigRGB) achieves 70.72, recovering 91.6% of baseline performance. Removing continuous grayscale causes a 1.96-point drop, while replacing adaptive triggering with uniform sampling results in a 1.72-point degradation, demonstrating that both components are essential. At 34.3% RGB usage, our model scores 75.24 (97.5% recovery), consistently outperforming uniform sampling by 2.60 points.
Performance across varying RGB frame ratios. As target rate r increases from 0.05 to 1.0, overall accuracy increases monotonically, confirming that color is beneficial but not always necessary. Task-specific trends reveal heterogeneous color dependencies: Attribute Perception rapidly saturates at ~26% RGB ratio; Spatial Understanding remains largely flat (65.0–69.1%), indicating minimal chromatic information is needed; while Clips Summarization and Text-Rich Understanding exhibit gradual gains.
Token Efficiency Analysis. By introducing Dynamic Token Router (DT) with 64 tokens for grayscale frames, our model reduces total visual tokens to only 31.1% while scoring 70.72 (91.6% performance recovery). At 34.3% RGB usage, DT achieves 75.24 with 50.7% token consumption, closely matching the 256-token variant (76.32) that uses twice the computational resources.
With only grayscale input, the MLLM fails to answer "What color is the SUV?" (incorrectly predicting "Blue"). ColorTrigger effectively triggers and captures the relevant RGB frames, enabling the model to correctly identify the answer as "Black".
coming soon