AI Research Scientist (PhD)
Lead foundational research in event-centric temporal reasoning, multimodal representation learning, and long-horizon sequential coherence.
- Fully Remote
- Competitive Compensation
- PhD Required
TL;DR
The bad
- We hire deliberately and hold a high bar
- No perks theatre — no merch, no retreats
- Demanding workload
- English or Spanish required
- We treat you like an adult, not a child
The good
- Temporal reasoning | multimodal AI | representation learning | production research
- Fully remote
- Above-market compensation
- Clear path to Research Lead
- Significant performance-based cash bonuses
- Rolling interviews
Who we are
Rkive is an AI lab focused on multimodal reasoning across time and complexity.
We develop novel architectures and products. We aim to create intelligent environments that work alongside you — proactively, reliably, and responsive to intent.
- Meaningful work: What we are building is genuinely unprecedented. The problems are hard and the opportunity is enormous.
- Autonomy: Fully remote. Manage your own time. Take time off when you need it.
- Zero politics: No bureaucracy, no posturing, no performative culture. Just the work.
- Mutual respect: We back our people, but we expect the same in return.
- Honest environment: Not a family, not a pressure cooker. A high-trust, high-performance team.
The role
You will lead foundational research in temporal and multimodal reasoning.
The core challenge: models trained on next-token prediction or contrastive objectives develop strong single-frame and short-context representations, but they do not learn temporal event structure. They do not model how semantic states transition across time, what causal dependencies connect events, or how intent propagates through a sequence. These properties do not emerge reliably from scale alone — they require explicit supervision signals and representation targets designed around temporal structure.
This is the problem you will work on. You will not be starting from scratch.
- Existing infrastructure: We have a live multimodal fusion layer that normalises heterogeneous media inputs — video, audio, images — into modality-specific token sequences supporting both early and late fusion strategies. We have a structured output schema that constrains model outputs to validated, parametrised representations of event boundaries, causal relations, and semantic state transitions — decoupling video understanding from video generation during training and reducing supervision cost by orders of magnitude relative to generation-based targets. We have a unified model interface that enforces standardised input-output contracts across all model interactions, enabling native cross-model evaluation without additional instrumentation. And we have a rendering engine that executes structured representations deterministically, providing a consistent evaluation substrate across model variants and human baselines.
- What is needed: A researcher who can design and train temporal event representations on top of this infrastructure — using real human editorial decisions as supervision signal, evaluating under distribution shift and long-context conditions, and iterating against live production metrics.
What you will do
Move from representation theory to deployed capability.
This is not a pure research position. Your work ships — and the infrastructure to ship it already exists.
- Temporal representation architecture: Design and implement learned representations that model event boundaries, causal dependencies, and semantic state transitions as first-class objects — not implicit correlates of frame-level features.
- Training on production signal: Develop training regimes that use real human editorial decisions as supervision targets. These decisions encode implicit temporal reasoning — what matters, when it matters, and why it relates to what preceded it — and are not available in standard video datasets.
- Evaluation and robustness: Build evaluation harnesses that measure causal coherence and semantic consistency under distribution shift, input perturbation, and context length scaling. Benchmark against external multimodal baselines through the existing model interface.
- Multimodal fusion: Develop methods for integrating visual, auditory, and textual signals into unified temporal representations, building on the existing fusion infrastructure.
- Efficiency research: Investigate dynamically sparse encoding guided by learned semantic attention — computation that scales with semantic complexity, not raw sequence length.
- Cross-domain generalisation: Extend temporal representations beyond the initial training domain to evaluate transfer across tasks and modalities.
- Publication and disclosure: Contribute to external research communication at appropriate timing, aligned with IP and partnership strategy.
How you will do it
Rigour first, then speed.
We value methodical research design over prolific but shallow experimentation.
- Hypothesis-driven: Formulate clear research questions, design controlled experiments, interpret results honestly.
- Production-aware: Design with deployment constraints in mind from the start — latency, memory, cost. The structured output schema and rendering pipeline are already optimised for this; your architectures should be too.
- Closed-loop evaluation: Leverage the fact that your models, external baselines, and human decisions all pass through the same execution environment. Use this for rigorous comparative evaluation that is not possible in most research settings.
- Collaborative: Work directly with the founder and engineering team. The founder sets research direction, defines priorities, and reviews work directly. This is a small, focused group where research and engineering are not separate functions — and where the CEO is technically hands-on.
- Literature and field awareness: Stay deeply current with relevant work in temporal modelling, multimodal learning, efficient architectures, structured prediction, and representation theory.
- Tools: Python, PyTorch (JAX a plus), Hugging Face ecosystem, distributed training infrastructure, standard MLOps tooling.
Who you are
This role requires genuine research depth — not just engineering fluency with ML libraries.
We are looking for someone who thinks in representations, not just benchmarks.
- PhD required: In computer science, machine learning, computational neuroscience, or a closely related field. Completed or defending by mid-2026.
- Research focus: Strong background in one or more of: temporal modelling, sequence representation, multimodal learning, video understanding, structured prediction, event-centric reasoning, or causal inference over sequences.
- Publication record: Demonstrated ability to produce and publish rigorous research. Quality and relevance over volume.
- Systems intuition: You understand the gap between a research prototype and a production system. You care about closing it — and you will be joining an environment where the infrastructure to close it already exists.
- First-principles thinker: You question assumptions rather than accept consensus defaults. You can articulate why current approaches to temporal coherence fail, not just what might work next.
- Independent and driven: You do not need to be managed. You need to be pointed at the right problem and given the resources to solve it.
- Ambitious: You want your work to matter — not just in citations, but in deployed systems used by real people.
When
Join within the next 90 days. Stay for the long term.
This is a multi-year research programme, not a short engagement. Each stage produces concrete deliverables integrated into production before the next begins.
- Rolling interviews: We interview and hire as applications arrive. First come, first served.
- Start date: Between April 1st and June 30th, 2026.
- Bonus: Performance-based cash bonuses.
What to send
Show us your research and your thinking.
A strong CV gets you a look. A clear research vision gets you an interview.
- CV: Focused on research contributions and impact.
- Selected publications: Two or three papers you are most proud of, with a brief note on your specific contribution.
- Research statement: A concise summary of your research interests and how they relate to temporal reasoning, multimodal representations, or sequential coherence. If you have views on why current approaches to long-horizon multimodal understanding fall short, we want to read them.
- Recommendations: From advisors, co-authors, or collaborators.
- Code (optional): Links to relevant repositories, implementations, or prototypes.
Apply
If this is the kind of problem you have been waiting to work on, we want to hear from you.
Send your CV and research materials to careers@rkiveai.com