Continuity Models: Beyond language and spatial intelligence

A research thesis for reliable multimodal reasoning in production

Multimodal AI is now good enough to confuse the conversation.

A state-of-the-art video model can render a spectacular fight scene: cinematic camera motion, glossy metal, volumetric smoke, showers of sparks, precise lighting. For a few seconds it looks like a blockbuster.

Then the sequence betrays itself.

A strike lands and the recoil travels in the wrong direction. Sparks shear upwind. An explosion lights the frame correctly but fails to perturb the objects it hits. A character's face changes subtly across cuts. The clip is impressive as a moving image, but nonsensical as an unfolding event.

In language, the equivalent failure is quieter: a system follows instructions, then gradually violates them; it uses the right vocabulary while losing the constraint that mattered; it answers fluently while drifting away from the earlier plan. Long context increases capacity, but meaning does not reliably persist. [1]

Across modalities, the pattern is consistent:

our models can render and describe with increasing fidelity, but they do not reliably preserve continuity of reasoning across time.

Rkive's research thesis is that continuity -- temporal, logical, and operational -- must become a first-class object of modeling and systems design. We call this paradigm Continuity Models.

1) Current limitations: what breaks today (and why it matters)

These are not edge cases. They are the practical limits that appear whenever systems are asked to understand, decide, and act over long horizons.

1.1 Language: high reasoning density, fragile persistence

Language is not a shortcut. It is civilization's highest-density encoding of abstraction: math, science, policy, engineering, strategy.

But modern LLMs operate in a regime where continuity is not native:

Long-context utilization is brittle. Even models that accept long inputs often underuse information in the middle of the context; performance depends sharply on where relevant facts appear. [1]
Compression reduces tokens, not necessarily semantics. Context compression methods exist because the need is real, but preserving the invariants that matter -- goals, constraints, definitions -- under aggressive compression remains difficult and degrades performance in practice. [2]
Continuity across inference runs is brittle. Production systems do not run once. They iterate: draft -> critique -> revise; agent A -> agent B -> retry. Keeping intent and constraints stable across these transitions -- without drift and without compounding error -- remains a core challenge of productizing long-horizon reasoning.

The failure mode is not always wrong answers. It is semantic drift: the model stays fluent while quietly changing what it thinks the task is.

1.2 Video generation: visual realism rising, event structure lagging

The public frontier in video is now undeniably impressive -- short clips with synchronized audio, high-resolution outputs, and complex multimodal conditioning are shipping. [4]

But the dominant limitations are not about aesthetics. They are about event structure:

Identity drift: objects and characters subtly morph across time -- faces, geometry, wardrobe, scale -- despite being the same.
Temporal hallucination: details appear and disappear because the model optimizes local plausibility rather than conserved state.
Causal incoherence: motion is smooth, yet dynamics are wrong -- forces do not propagate, contacts do not constrain, effects move in the wrong direction.
Narrative collapse: goals and constraints are not conserved; the sequence stops behaving like an unfolding plan and becomes a sequence of visually plausible moments.

This is why coherence is still easiest to demonstrate over seconds, not minutes. Short-horizon video can hide the absence of explicit temporal structure.

1.3 Spatial intelligence: richer worlds, mostly static semantics

Spatial systems are a real frontier. World generation and persistent 3D environments matter for interaction, simulation, and robotics. [5][6]

But persistent space is not the same thing as persistent time.

A system can generate an editable, navigable 3D world and still lack:

explicit event boundaries
state transitions that conserve invariants
causal dependencies that survive interaction
intent propagation across extended sequences

Many world systems today are closer to world state than world dynamics: spectacular environments with weak temporal reasoning.

1.4 Physics-leaning / representation-first approaches: scientifically strong, hard to steer into products

Representation learning approaches that avoid pixel reconstruction and text supervision are intellectually important. JEPA-style video representation learning is one example: learning predictive structure in representation space rather than generating pixels. [7][8]

The productization tension is also real:

Steering: general representations do not automatically produce controllable, typed decisions (what matters, when, and why).
Interfaces: production systems need validated outputs and stable contracts; good embeddings are not enough.
Evaluation alignment: representation quality must be measured against continuity under real distribution shift, not only against offline benchmarks.

These methods can strengthen the substrate. They do not, by default, deliver end-to-end continuity of reasoning.

2) Why more context is not the same as continuity

A large fraction of the field is solving time by increasing capacity:

longer context windows
more frames
longer clips
heavier retrieval
aggressive compression

These are necessary tools. They are not the missing primitive.

You can process a million tokens and still lose the constraint that mattered. You can generate visually coherent motion and still violate causality. You can retrieve relevant context and still fail to preserve semantic value across multi-step reasoning. [1][2]

Continuity is not a capacity claim. It is a representational and systems claim.

3) The thesis: Continuity Models

Rkive is building Continuity Models: models and systems designed to maintain continuity -- temporal, logical, and operational -- across language, video, and multimodal workflows. The term has a double meaning: temporal continuity as the mechanism, logical continuity as the outcome.

The thesis is direct:

Reliable long-horizon multimodal reasoning requires explicit modeling of temporal event structure. Temporal continuity is a practical path to logical continuity.

By event structure, we mean first-class representational objects such as:

event boundaries: what counts as a meaningful unit of change
semantic state transitions: what changed vs. what remained invariant
causal dependencies: why the transition occurred
intent propagation: how goals and constraints carry forward across steps

If these remain implicit, they are fragile and difficult to supervise. If they are explicit, they can be trained, evaluated, compared, and improved.

4) Why we prioritize time as a training vector (without dismissing language or space)

Language is unmatched in informational density. Spatial modeling is essential for grounded structure. We treat both as foundational.

We prioritize time because time imposes pressures that neither static images nor text-only corpora impose reliably.

4.1 Time forces invariants to survive change

Identity, intent, constraints, causality -- these are not optional. They must persist while surface details vary. Time makes it obvious when they do not.

4.2 Time introduces useful stochasticity

Temporal sequences naturally include variance: camera changes, compression artifacts, edits, interruptions, speaker shifts, background drift. This variance pressures representations to generalize beyond static correlations while remaining anchored to meaning.

4.3 Time makes fusion real

Multimodal fusion is not concatenation. It is alignment into events: a word lands with a gesture; audio binds to an impact; a cut lands on a beat. Temporal structure is the substrate that binds modalities into the same underlying event.

4.4 Cognitive grounding for event structure

Humans parse continuous experience into discrete events, and event boundaries relate to memory updating and learning. [3] Developmental theories also emphasize cognition emerging through perception and action unfolding over time (for example, object permanence and the move from sensorimotor experience to mental representation). [9]

We are not claiming models must mimic humans. We are claiming that event structure is a plausible, defensible representational target for building stable long-horizon reasoning.

5) From thesis to production: making continuity measurable

Continuity is not something you hope emerges. In production, it has to be inspectable.

That implies a stack where:

multimodal inputs are normalized into stable representations, independent of provider quirks
outputs make temporal structure explicit (so continuity can be evaluated, not inferred)
execution is deterministic enough that improvements can be attributed to representation quality
evaluation is native across model variants (internal and external) using the same contracts

This is why our research emphasizes structured temporal representations and interface-level comparability: continuity is only improvable when it is measurable.

6) Methodology: real human decisions as temporal supervision

Benchmarks are reproducible. They are often not the highest-signal source for production reasoning.

Rkive trains and evaluates against real human decisions in production -- the kind of signal that encodes temporal reasoning implicitly:

what mattered
when it mattered
why it related to what came before
how intent should persist across revisions

This supervision signal is difficult to obtain from scraped datasets alone. It emerges when systems are deployed and used repeatedly under real constraints.

That loop -- research -> production -> human signal -> stronger representations -- is a strategic advantage when the goal is reliability, not only demos.

7) What this is building toward

Continuity Models are not an aesthetic project. They are a reliability project.

The target is systems that can:

preserve intent and constraints across long horizons
maintain causal coherence rather than local plausibility
remain stable across edits, retries, and multi-agent inference loops
fuse modalities into event-level understanding
support genuinely useful reasoning in production settings

This is the direction.

References

[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
URL: https://arxiv.org/abs/2307.03172

[2] Huang, C., Zhu, G., Wang, X., Luo, Y., Ge, G., Chen, H., Yi, D., Wang, J. (2024). Recurrent Context Compression: Efficiently Expanding the Context Window of LLM. arXiv:2406.06110.
URL: https://arxiv.org/abs/2406.06110

[3] Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S., Reynolds, J. R. (2007). Segmentation in the perception and memory of events. Trends in Cognitive Sciences.
URL: https://www.sciencedirect.com/science/article/pii/S1364661307003312

[4] ByteDance Seed. (2026). Seedance 2.0 -- unified multimodal audio-video joint generation architecture.
URL: https://seed.bytedance.com/en/seedance2_0

[5] World Labs. (2025-2026). World Labs -- spatial intelligence company building models that perceive, generate, and interact with the 3D world.
URL: https://www.worldlabs.ai/

[6] Vincent, J. (2025). World Labs is betting on world generation as the next AI frontier. The Verge.
URL: https://www.theverge.com/ai-artificial-intelligence/820016/world-labs-is-betting-on-world-generation-as-the-next-ai-frontier

[7] Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA). arXiv:2404.08471.
URL: https://arxiv.org/abs/2404.08471

[8] Meta AI / FAIR. (2024-2026). facebookresearch/jepa -- V-JEPA codebase and resources.
URL: https://github.com/facebookresearch/jepa

[9] OpenStax / Baylor University OpenBooks. (n.d.). Cognition in Infancy and Childhood (Piaget; sensorimotor development, object permanence).
URL: https://openbooks.library.baylor.edu/lifespanhumandevelopment/chapter/chapter-9-1-cognition-in-infancy-and-childhood/

Continuity Models: Beyond language and spatial intelligence

Continuity Models: Beyond language and spatial intelligence

A research thesis for reliable multimodal reasoning in production

1) Current limitations: what breaks today (and why it matters)

1.1 Language: high reasoning density, fragile persistence

1.2 Video generation: visual realism rising, event structure lagging

1.3 Spatial intelligence: richer worlds, mostly static semantics

1.4 Physics-leaning / representation-first approaches: scientifically strong, hard to steer into products

2) Why more context is not the same as continuity

3) The thesis: Continuity Models

4) Why we prioritize time as a training vector (without dismissing language or space)

4.1 Time forces invariants to survive change

4.2 Time introduces useful stochasticity

4.3 Time makes fusion real

4.4 Cognitive grounding for event structure

5) From thesis to production: making continuity measurable

6) Methodology: real human decisions as temporal supervision

7) What this is building toward

References

Join Our Newsletter

Join Our Newsletter

Research

Company

Studio

Base

Connect

Pricing

Privacy

Research

Company

Studio

Base

Connect

Pricing

Privacy