TEMPO
Temporal Event Modeling for Perception & Organization
Multimodal models trained on next-token prediction or contrastive objectives develop strong single-frame and short-context representations. They do not learn temporal event structure. They do not model how semantic states transition across time, what causal dependencies connect events, or how intent propagates through a sequence. These properties do not emerge reliably from scale alone — they require supervision signals and representation targets explicitly designed around temporal structure.
TEMPO researches learned temporal event representations for multimodal sequences. The central hypothesis is that reliable long-horizon multimodal reasoning requires explicit modeling of event boundaries, causal dependencies, and semantic state transitions as first-class representational objects — not as implicit correlates of frame-level features. We develop and evaluate these representations using STR artifacts as supervision targets, which provide semantically dense, temporally structured training signal at significantly lower cost than pixel-space or latent-space generation objectives.
Training on real human decisions in production — rather than synthetic benchmarks or web-scraped video — is a deliberate methodological choice. Human editorial decisions encode implicit temporal reasoning: what matters, when it matters, and why it relates to what preceded it. This signal is not available in standard video datasets and is not reproducible outside a production deployment surface.
The program runs in three stages, each producing research artifacts integrated into production before the next begins.
Stage 1: Temporal event representation architecture. Long-context evaluation harness designed for causal coherence and semantic consistency under distribution shift. Production integration for real-world measurement against live human signal.
Stage 2: Cross-domain generalization of temporal representations. Robustness under input perturbation and context length scaling. Evaluation against external multimodal baselines through the UMI.
Stage 3: Long-horizon sequential reasoning across extended multimodal contexts. Cross-task transfer of temporal representations beyond the training domain. Selective external disclosure aligned with IP and partnership strategy.