...The research prototype includes components for visual grounding (understanding when a user references something in view), gesture recognition and synthesis, and turn-taking mechanisms that mirror human conversational timing. Because latency and synchronization are critical, the codebase invests in asynchronous scheduling, overlap of perception and reasoning, and fast fallback responses.