...The result is a perception layer that feeds images and timestamped transcripts into Claude, allowing it to analyze events, answer questions, and summarize content with contextual awareness. The system dynamically adapts how much data it extracts based on the user’s query, adjusting frame rate, resolution, and time windows to optimize both performance and token efficiency. It supports multiple backends for audio processing, including local and cloud-based options, enabling flexible deployment depending on privacy or performance requirements.