ground-event→ Find specific event timestamps (approximate video search)
extract-video-parts→ Extract frame clips from relevant segments
analyze→ Analyze frames with VLM
transcribe-speech→ ASR
web-search→ External knowledge
In essence, the actual operation is largely iterative video retrieval + reasoning agent.
SAGE: Smart Any-Horizon Agents
"What are the technical challenges toward effectively training video reasoning models under the AGENT paradigm with Reinforcement Learning?"
https://praeclarumjj3.github.io/sage/
SAGE - a allenai Collection
Smart Any-Horizon Agent for Long Video Reasoning
https://huggingface.co/collections/allenai/sage

Seonglae Cho