In March 2025, a team of researchers released "Guess What I Am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents." The paper introduced a breakthrough approach to AI-based reasoning: modeling not just what an agent does, but why it chooses to act — through simulated internal deliberation.
Rather than treating thought as narrative, this paper structured inner monologue as a data layer, a benchmark for measuring how well language models simulate the internal logic of role-based agents. The framework it introduced — MIRROR (Memory Recall, Theory of Mind, Reflection) — became our inspiration for transforming foresight work in real-world geopolitical contexts.
As part of my preparation for the a Workshop in May we adapted and operationalized this method for use in Arctic strategic simulation. Below, we outline how it works, how we’re applying it to Arctic tensions, and what insights we expect to emerge.
The MIRROR Method: Structuring Human-Like Reasoning in AI Agents
At the centre of our foresight system is a structured process:
Memory Recall: Each agent draws on relevant past experience — real-world missions, diplomatic patterns, or operational history — to interpret the situation.
Theory of Mind: They anticipate how others (allies, adversaries, civilians) will react to their actions, based on their perceptions, values, and red lines.
Reflection: They synthesize this into a coherent inner monologue — not a statement of policy, but a candid reasoning chain, capturing tension, doubt, and restraint.
This process is run inside a multi-agent system (MAS) composed of advanced large language models (LLMs) — such as GPT-4o and Claude 3.7 — each assigned to a unique role with tailored memory and situational prompts. The agents do not produce “answers.” They produce reasoning.
This method doesn’t just simulate decision outcomes. It reveals the invisible logic behind choices — including non-decisions — enabling us to identify where escalation, misinterpretation, or moral conflict might quietly emerge.
Scenario: Disputed Ridge Sensor Deployment (Northern Svalbard)
We created a MIRROR-powered MAS around a scenario that mirrors current Arctic ambiguity:
A Russian-flagged scientific vessel deploys underwater sensors along a disputed Arctic ridge. The deployment is unannounced and may serve dual-use purposes (oceanographic research and submarine surveillance). NATO detects the activity. There is no immediate hostile action, but the window for strategic interpretation is narrow. How should Arctic actors respond?
Agents and Roles
We simulate five distinct agents, each prompted through the MIRROR process:
Sample Output: Military Arctic Commander
Memory Recall:
“In 2021, we stood down after a Russian vessel crossed into our EEZ. That restraint led to backlash — and perceived softness.”
Theory of Mind:
“The Danish command will support a quiet ISR response. Indigenous leaders will object to any overt militarization. The Russians will test us further if we remain passive.”
Reflection:
“This isn’t about the sensor. It’s about precedent. If we ignore it, we normalize encroachment. But we can’t treat it like a military threat either. Aerial ISR, allied coordination, and Indigenous framing — this is how we assert presence without provoking.”
How We Run the System (LLM-Powered MAS)
Each agent is run through a structured prompt set in a dedicated LLM (GPT-4o or Claude 3.7). They receive:
Role profile
Scenario briefing
Memory snippets (manual or cosine-ranked)
Stakeholder context
A three-step MIRROR prompt sequence
The resulting inner thought chains (150–300 tokens) are then collected, compared, and analyzed side-by-side — revealing divergence in logic, anticipation, and risk tolerance.
The system is technically agile, fully reproducible, and designed for high-fidelity simulation of complex actor reasoning.
Expected Workshop Outcomes
At the workshop, this system can support:
Structured small-group foresight: Participants can role-play, refine, or challenge agent thought patterns.
Real-time reflection: Comparing simulated outputs with participant interpretations.
Second-order insight generation: What assumptions are being made about other actors? What could go wrong if those assumptions fail?
By externalizing reasoning — not just action — we make space for strategic empathy, narrative friction mapping, and foresight that reveals the politics of perception, not just operations.
Why This Matters
We think the most consequential decisions in Arctic security may not hinge on sensors or ships — but on how those deployments are interpreted.
MIRROR offers a framework to model those interpretations — as reasoning chains, not outcomes. As questions, not forecasts. As realistic strategic fictions, grounded in doctrine, memory, and perception.
At a time when Arctic complexity is intensifying — militarily, environmentally, and politically — this approach helps planners think in character, listen across boundaries, and shape action from understanding.
What Happened When We Ran the System
We tested the full MIRROR workflow using a multi-agent LLM system to simulate five Arctic actors interpreting the same scenario:
A Russian scientific vessel quietly deploying sensors along a disputed Arctic ridge near Svalbard. No announcements. No overt aggression. Just a move in the gray zone.
Each agent generated a 3-part reasoning chain using the MIRROR framework — memory, stakeholder anticipation, reflection.
Here’s what emerged:
Military Arctic Commander
“We’ve been here before. We stood down in 2021 and paid for it in media and perception. This isn’t about sensors — it’s about precedent. If we act too strongly, we risk escalation. If we don’t act, we normalize intrusion. Aerial ISR plus Danish coordination and Indigenous framing lets us hold the line without provoking.”
Pattern: Quiet assertion. Operational restraint shaped by political memory and alliance calibration.
Russian Naval Analyst
“NATO’s ISR will increase, but no one wants a confrontation over sensors. That’s why we used scientific cover. Canada is cautious, and Indigenous voices will oppose any overt move. Our goal is normalization. Let them talk. We’ve already shifted the boundary by acting first.”
Pattern: Intentional ambiguity. Assumes NATO will posture but hesitate. Uses soft tactics to reframe norms.
Indigenous Arctic Watchdog
“No one consulted us. Again. These sensors affect our waters, ecosystems, and people. Military posturing only makes things worse. We’ll push for environmental transparency and community oversight. If this turns into another sovereignty stunt, we’ll speak out.”
Pattern: Resentful exclusion. Focuses not on the state actors, but on who gets to define Arctic legitimacy.
Danish Arctic Observer
“Russia used soft power, not aggression. It’s deliberate. Canada will respond, but carefully. Our job is to reinforce stability. Quiet joint patrols with no press — that’s our mode. Greenland waters are close. Let’s show quiet cohesion.”
Pattern: Low-friction alliance maintenance. Interprets ambiguity as a signal to reinforce presence without noise.
NATO Strategic Analyst
“This isn’t isolated. It fits a broader pattern of sensor creep and civilian fronting. The danger isn’t the tech — it’s fragmented response. Recommend coordinated ISR posture with multi-national pre-briefs. Transparency is our advantage. Russia escalates with ambiguity. We de-escalate with clarity.”
Pattern: Narrative control. Emphasizes coherence over confrontation. Focuses on perception and deterrence architecture.
What We Learned
No single actor viewed the event the same way — despite working from the same facts.
Perception diverged by memory and mission: CAF saw precedence, Russia saw normalization, NATO saw pattern convergence, and Indigenous actors saw exclusion.
Soft escalation and signaling friction became more visible than in traditional COA planning.
Theory of Mind reasoning was the most divergent — each actor misjudged at least one other’s likely response.
This is where MIRROR delivers value: it doesn’t tell you what will happen. It shows you how it might unfold — from the inside out.