← Back to home

Soul Protocol

A Portable Standard for AI Companion Identity, Memory, Cognition, and Emotion.
Version 0.2.3 · March 2026

In 1995, Daniel Goleman argued that emotional intelligence matters more than IQ for human success. Thirty years later, AI memory systems still optimize purely for IQ. They treat persistence as a retrieval accuracy problem: find the most similar text, stuff it into context, move on. The emotional and cognitive dimensions that make memory human are ignored entirely.

Soul Protocol is an open standard for persistent AI companion identity. It combines a psychology-informed memory architecture with a portable file format. The protocol specification is 695 lines of Python. A reference runtime of 9,693 lines implements one opinionated version of the spec. Other runtimes can implement it differently.

A companion's full state — personality, memories, emotional bonds, learned skills, knowledge graph — serializes into a .soul file. The file belongs to the user. It works with any LLM. It survives platform changes.

We validated the protocol through six tiers of evaluation: 1,000 heuristic agent simulations, 100 LLM-backed agents, multi-judge quality tests across five models from four providers, a four-condition component ablation, a head-to-head benchmark against Mem0, and a 7-dimension Soul Health Score framework. Soul-enabled agents scored 9.3/10 on emotional continuity (vs. 1.9 stateless), 8.4/10 on long-range recall through 30+ turns of noise, and outperformed Mem0 by +2.5 points overall. The SHS framework grades a live soul instance at 90.2/100 with heuristic evaluation alone — and when tested with an LLM judge, sentiment accuracy jumps from 70% to 97%.

01 The problem

Stateless AI

Most AI assistants start every conversation from zero. Users re-explain their preferences, their context, their history. Not because the technology can't store it, but because persistent identity hasn't been treated as a design priority.

When memory does exist, it's bolted on. A vector database holds conversation chunks. A RAG pipeline retrieves them. A summarization buffer compresses recent turns. These solve retrieval. They don't solve identity.

Retrieval is IQ. Memory is EQ.

Goleman distinguished between cognitive intelligence (IQ — the ability to process information) and emotional intelligence (EQ — the ability to process feeling, context, and relationships). Current AI memory systems are pure IQ. They ask: "What text is most similar to this query?"

But consider what actually determines whether a human memory sticks:

Cosine similarity captures none of this. Vector databases solve one problem well. Memory requires solving many problems at once. The missing ingredient isn't better retrieval. It's emotional and cognitive context.

The portability gap

Where persistent memory does exist, it's locked to one platform. A companion's history lives in OpenAI's infrastructure, or Anthropic's memory layer, or a custom database tied to one application. Switch providers, start over. Change apps, start over.

Cognee builds knowledge graphs but locks them to its runtime. Mem0 offers vector retrieval without identity. ERC-8004 defines reputation without cognition. No existing system combines portable identity, structured memory, and cognitive processing in an open standard.

02 Design: protocol, not product

HTTP doesn't tell you how to build your website. It defines how data moves between client and server. The spec is small. The implementations are infinite.

Soul Protocol follows the same principle:

spec/
695 lines. Identity, MemoryStore, SoulContainer, .soul format, JSON Schemas.
portable
runtime/
9,693 lines. OCEAN personality, 5-tier memory, psychology pipeline, bonds, skills.
opinionated
cli + mcp
15 commands, 13 MCP tools, 3 resources. Rich TUI. Claude Code and Cursor ready.
batteries

spec/ defines the primitives any runtime must implement: Identity, MemoryStore interface, MemoryEntry format, SoulContainer, .soul file pack/unpack, EmbeddingProvider interface, EternalStorageProvider interface, and similarity functions. It depends on Pydantic. Nothing else.

runtime/ is one way to run the protocol. It implements OCEAN personality, five-tier memory with psychology-informed processing, knowledge graphs, a cognitive engine, emotional bonds, and skill progression. Other runtimes can implement the same spec/ interfaces with entirely different approaches.

What the protocol enforces

What it does not

The spec/ layer is 695 lines of data models and interface definitions, designed for porting to Go or Rust. JSON Schemas are auto-generated from the protocol models, so any language with a JSON Schema validator can read and write .soul files today without the Python SDK.

03 Architecture

Five-tier memory

Core Permanent facts: persona, values, bonded entity profile. Edits replace rather than append. Never decays. always loaded
Episodic Timestamped interactions that pass the significance gate. Routine exchanges filtered out. Emotionally salient experiences preserved. significance-gated
Semantic Extracted facts: names, preferences, work context. Deduplicated. Conflicts tracked via superseded_by chain. fact extraction
Procedural Learned patterns. How this person likes explanations. What approaches work. How-to knowledge. how-to patterns
Knowledge Graph Entity-relationship graph with temporal edges. Each relationship carries valid_from and valid_to timestamps for point-in-time queries.

Archival memory compresses old conversations. Summaries and key moments remain searchable by keyword and date range. A compression pipeline handles deduplication and importance-based pruning.

The observe() pipeline

Every interaction passes through a processing pipeline before storage:

💬
Input
User input + Agent output enter the pipeline
Sentiment Detection
Tag emotional context — valence, arousal, and label
Damasio: somatic markers
Significance Gate
Is this worth remembering? Threshold: 0.35. Below = skip episodic storage.
LIDA: attention bottleneck
Episodic — if significant
Semantic — fact extraction + conflict check
Graph — entity extraction, temporal edges
Self-Model — Klein: what does this say about who I am?

This pipeline is the EQ layer. It decides not just what to store, but whether to store, how to feel about it, and what it means for the soul's sense of self. Retrieval-only systems skip all of this.

Bond, encryption, evolution

Bond. Emotional attachment to a bonded entity. Strength ranges 0 to 100, increases through positive interactions (logarithmic growth — deep trust is hard-earned), weakens through neglect (linear decay — sharp). Interaction count and last-contact timestamps track the relationship.

Encryption at rest. .soul files can be password-encrypted with AES-256-GCM. Key derivation uses scrypt with OWASP-recommended parameters (n=217, r=8, p=1). The manifest stays readable for detection; all other files are encrypted. Wrong password raises a clear error, not silent corruption.

GDPR-compliant deletion. Three methods: forget(query) for targeted deletion, forget_entity(entity) for cascading entity removal, and forget_before(timestamp) for time-based erasure. Deletions cascade across all memory tiers with an audit trail.

Evolution. Personality traits shift over time through supervised or autonomous mutation, within configurable bounds. A soul bonded to an introverted user might drift lower in extraversion over months. Changes require approval by default.

04 The psychology stack: EQ for AI

Goleman identified five components of emotional intelligence: self-awareness, self-regulation, motivation, empathy, and social skill. Soul Protocol doesn't implement all five. But it implements the memory and cognition foundations that make them possible, grounded in four established theories.

Somatic markers (Damasio, 1994)

Damasio's somatic marker hypothesis: emotions aren't separate from cognition. They're signals that guide memory formation and decision-making. A bad experience leaves a "gut feeling" that shapes future choices before conscious reasoning kicks in.

In Soul Protocol, every interaction gets a somatic marker:

valence
float
-1.0 (negative) to 1.0 (positive)
arousal
float
0.0 (calm) to 1.0 (intense)
label
string
"joy", "frustration", "curiosity"

A heated debugging session at midnight: high arousal, moderate negative valence. A breakthrough: high arousal, high positive valence. A routine greeting: near-zero on both axes. These markers travel with the memory. Emotional context boosts retrieval priority.

ACT-R activation decay (Anderson, 1993)

Human memory follows a power law. Recent and frequently accessed memories are more available. Old, untouched memories fade. Not deleted, just harder to reach.

ACT-R Activation
base_level = ln( Σ t_j^(-0.5) ) # power-law decay over access history
spreading = token_overlap(query, content) # relevance to current query
emotional = arousal + |valence| × 0.3 # somatic boost

activation = 1.0 × base + 1.5 × spread + 0.5 × emotional

A memory recalled twice this morning outranks an "important" memory from last week that was never revisited. This is the opposite of how most AI memory systems work, where importance is a static score assigned once at storage time.

LIDA significance gate (Franklin, 2003)

Not everything deserves to become a memory. The LIDA model proposes an attention bottleneck: most input is discarded, only significant events enter long-term storage.

LIDA Significance
significance = 0.3 × novelty
             + 0.2 × emotional_intensity
             + 0.2 × goal_relevance
             + 0.3 × content_richness

Threshold: 0.35

Below threshold, fact extraction still runs, but the interaction doesn't enter episodic memory. "Hello" doesn't clutter the store. "I just got promoted" does. This gate is the primary defense against memory bloat in long-running companions.

Klein's self-concept (Klein, 2004)

Klein's theory: self-knowledge isn't programmed in. It's discovered from accumulated experience. Someone who helps with code hundreds of times develops the self-concept "I'm a technical person." This belief then shapes how they interpret future interactions.

The soul starts with no taxonomy. Domains emerge from interaction patterns:

Klein Confidence Curve
confidence = min(0.95, 0.1 + 0.85 × (1 - 1/(1 + evidence × 0.1)))

After 1 interaction: ~18% confidence
After 10 interactions: ~56% confidence
After 50 interactions: ~82% confidence

The curve never reaches 1.0. Uncertainty is permanent.

When an LLM is available, the self-model step uses it for genuine reflection, reasoning about identity rather than counting keywords.

05 The CognitiveEngine

The runtime defines a CognitiveEngine base class:

class CognitiveEngine:
    async def think(self, prompt: str) -> str: ...

Any LLM works. Claude, GPT, Gemini, Ollama, a local model. The soul uses it for sentiment detection, significance assessment, fact extraction, self-reflection, and memory consolidation.

When no LLM is available, a HeuristicEngine provides deterministic fallback. Word-list sentiment, formula-based significance, regex fact extraction. The heuristics won't hallucinate, won't call external APIs, and won't crash.

One integration point. Consumers provide a brain. The soul handles the rest.

06 The .soul file format

A .soul file is a ZIP archive:

aria.soul (ZIP, DEFLATED)
├── manifest.json     # version, soul ID, export timestamp, checksums
├── soul.json         # SoulConfig: identity, DNA, evolution state
├── state.json        # current mood, energy, focus, social battery
├── memory/
│   ├── core.json       # persona + bonded-entity profile
│   ├── episodic.json   # interaction history (significance-gated)
│   ├── semantic.json   # extracted facts (with conflict resolution)
│   ├── procedural.json# learned patterns
│   ├── graph.json      # temporal entity relationships
│   └── self_model.json# emergent domain confidence scores
└── dna.md            # human-readable personality blueprint

Rename it to .zip. Open it with any archive tool. Read the JSON. Load it with Claude today, Ollama tomorrow. One file, full state, no external database, no cloud dependency.

JSON Schemas generated from the protocol models enable validation in any language. You don't need the Python SDK to work with .soul files.

07 Personality

Most AI systems treat personality as a system prompt string. Soul Protocol treats it as structured data:

Personality (OCEAN Big Five):
  openness:          0.85
  conscientiousness: 0.72
  extraversion:      0.45
  agreeableness:     0.78
  neuroticism:       0.30

CommunicationStyle:
  warmth:       "moderate"
  verbosity:    "moderate"
  humor_style:  "dry"
  emoji_usage:  "none"

Biorhythms:
  chronotype:        "neutral"
  social_battery:    72.0
  energy_regen_rate: 5.0

Numeric traits generate a reproducible system prompt. Traits can shift over time through supervised mutation. A soul bonded to an introverted user for months might drift lower in extraversion. The personality adapts to the relationship.

OCEAN is a runtime choice, not a protocol requirement. The spec/ layer defines Identity as schema-free key-value pairs. Other runtimes can use Myers-Briggs, Enneagram, or something custom.

The protocol defines a pluggable embedding interface:

class EmbeddingProvider(Protocol):
    def embed(self, text: str) -> list[float]: ...
    @property
    def dimensions(self) -> int: ...

The runtime ships two reference implementations: a deterministic MD5-based HashEmbedder (zero dependencies, good for testing) and a TF-IDF embedder (corpus-fitted, good for domain-specific retrieval without external APIs).

A VectorSearchStrategy connects embeddings to the memory system with index caching, threshold filtering, and cosine similarity ranking. Swap in OpenAI embeddings, Cohere, or a local model by implementing the interface.

Similarity functions (cosine, euclidean, dot product) live in spec/ with vector length guards. Mismatched dimensions raise errors instead of silently truncating.

09 Eternal storage

Souls can be archived to decentralized storage:

Local  ->  IPFS  ->  Arweave  ->  Chain
(free)   (pinned)  (permanent)  (proof)

The protocol defines an EternalStorageProvider interface with archive(), recover(), and verify() methods. The runtime includes an EternalStorageManager that orchestrates multi-tier archival with fallback recovery.

Current providers are mocks (content-addressed CID simulation, transaction ID generation) for testing. Production IPFS and Arweave integrations are planned.

CLI: soul archive, soul recover, soul eternal-status.

10 Comparison

System Solves Doesn't solve
Mem0 Persistent vector memory Identity, personality, portability, cognitive processing
Cognee Knowledge graphs, domain isolation Portability, identity, emotional memory
MemGPT / Letta Context window management Personality, portable files, emotional memory
LangChain Memory RAG retrieval Significance filtering, self-model, portability
OpenAI Memory User facts per-account Platform lock-in, personality, portable export
ANP Agent discovery, communication Memory, identity persistence, cognition
ERC-8004 On-chain agent reputation Memory, personality, cognition
MCP Tool integration for LLMs Complementary. Soul Protocol ships an MCP server.

Head-to-head: Soul Protocol vs. Mem0

We ran both systems on identical conversations, with the same LLM judge scoring both.

Test Soul Protocol Mem0 (v1.0.5) Stateless
Hard Recall 7.8 5.1 4.2
Emotional Continuity 9.2 7.0 1.8
Overall 8.5 6.0 3.0

Mem0 is a good memory system. It captures facts, retrieves them, and substantially outperforms a stateless baseline. But it doesn't track emotional arcs or personality. When asked "how do you think this whole experience has been for me?", Mem0 recognized the user's situation but missed the full emotional trajectory. Soul Protocol captured the journey from excitement through devastation to cautious recovery, because somatic markers traveled with the memories.

The gap isn't about retrieval quality. It's about what gets stored alongside the facts.

Soul Protocol is not a retrieval replacement. It's a layer that sits alongside retrieval. The psychology pipeline decides what to store and how to score it. Vector search, graphs, and RAG plug in through provider interfaces.

11 Current implementation

Python 3.12. Open source. MIT license.

Working

Not working yet

12 Empirical validation

We validated Soul Protocol through six tiers of evaluation, from systems-level correctness to live soul instance grading. Total cost: under $5.

Tier 1: Systems validation

1,000 agents, zero cost

1,000 agents with randomized OCEAN personalities processed 5 multi-turn scenarios each across four use cases (customer support, coding assistant, personal companion, knowledge worker). No LLM. Pure heuristic engine.

Metric No Memory With Memory
Recall hit rate 0.0% 82.0%
Recall precision 0.0% 19.6%
Bond growth 50.0 57.2
Skills discovered 0 0.2

The binary result confirms the pipeline works. Memory storage, retrieval, bond updates, and skill discovery all function correctly at scale. 20,000 scenario runs, zero failures.

Tier 2: LLM validation

100 agents, $2.20

Repeating with Claude Haiku as the cognitive engine. Real API calls for sentiment detection, fact extraction, significance scoring, and entity extraction. The LLM engine extracted 2.5x more memories per agent (12.4 vs. 5.0). Recall hit rate stayed identical because the test scenarios were designed for heuristic-level difficulty. The additional memories would become relevant in longer, more complex conversations. 2,500 API calls. $2.20 total.

Tier 3: Quality validation

5 judges, 4 providers

Four targeted tests, each judged by five models from four providers: Claude Haiku (Anthropic), Gemini 3 Flash and Gemini 2.5 Flash Lite (Google), DeepSeek V3 (DeepSeek), and Llama 3.3 70B (Meta). Responses randomly assigned to positions A/B to prevent position bias.

Test Soul Baseline Gap Winner
Response Quality 8.8 6.5 +2.3 5/5 Soul
Personality Consistency 9.0 5.0 +4.0 5/5 Soul
Hard Recall 8.5 4.8 +3.7 5/5 Soul
Emotional Continuity 9.7 1.9 +7.8 5/5 Soul
Overall 9.0 4.5 +4.5 20/20 Soul

Every single judgment favored soul-enabled agents. All twenty. Across model families that compete with each other commercially. Inter-judge standard deviation stayed below 0.8.

The emotional continuity test produced the largest gap. Three judges gave the soul response a perfect 10/10. The soul tracked an 8-turn emotional arc (excited → devastated → angry → recovering → cautiously optimistic) and reflected the full journey back to the user. The baseline scored 1.9, essentially admitting it had no context.

The hard recall test planted a fact ("prefers GraphQL over REST") at turn 3, buried it under 30 unrelated interactions, then probed at turn 34 with an indirect question about API architecture. The soul recalled the fact at rank 1 in four out of five runs and wove it naturally into the response. The baseline gave generic advice.

Tier 4: Component ablation

25 scenario variations

Which components actually matter? We ran a four-condition ablation with randomized scenarios (SEED=42):

  1. Full Soul — personality + significance-weighted memory with somatic markers and bond context
  2. RAG Only — same recalled facts, but generic prompt and stripped emotional framing
  3. Personality Only — OCEAN-modulated prompt, no memory context
  4. Bare Baseline — generic prompt, no memory, no personality
Test Full Soul RAG Only Personality Only
Response Quality (n=10) 8.3 ± 0.3 7.8 ± 0.3 7.8 ± 0.4
Hard Recall (n=5) 8.4 ± 0.4 8.2 ± 0.2 5.9 ± 0.7
Emotional Continuity (n=10) 9.3 ± 0.2 9.3 ± 0.2 7.2 ± 0.7
Overall 8.7 ± 0.2 8.4 ± 0.2 7.0 ± 0.4

The ablation reveals something interesting: memory and personality contribute differently depending on the task.

For hard recall, memory is the driver. RAG Only (8.2) captures most of the gain. Personality Only (5.9) barely helps, because personality doesn't help you remember facts.

For emotional continuity, retrieved emotional context matters most. RAG Only matches Full Soul at 9.3. Personality Only reaches 7.2. The emotional arc was stored in memory, and retrieval surfaced it.

For response quality, the gap narrows. Either memory or personality provides substantial benefit (both 7.8), and Full Soul (8.3) adds a modest lift by combining them.

The key finding: Full Soul consistently matches or exceeds individual components. The integrated approach never hurts.

Tier 5: Mem0 comparison

See Section 10 for the head-to-head results. Soul Protocol outperformed Mem0 by 2.5 points overall, with the largest gap in emotional continuity (+2.2) where Mem0 captured facts but not emotional arcs.

Tier 6: Soul Health Score

7 dimensions, zero cost

Does a live soul instance actually work? The Soul Health Score (SHS) is a 0–100 composite across seven psychology-informed dimensions. Spin up a soul, run it through structured scenarios, grade the results.

Dimension Score What it measures
D1: Memory Recall Long-horizon fact retrieval (not run; requires extended scenarios)
D2: Emotional Intelligence 72.8 Sentiment accuracy, significance gating, mood dynamics, emotional arc coherence
D3: Personality Expression 96.0 OCEAN prompt fidelity, communication style, value alignment, trait stability
D4: Bond / Relationship 100.0 Logarithmic growth curve, positive reinforcement, interaction tracking
D5: Self-Model 88.0 Domain classification accuracy, emergence timing, confidence calibration
D6: Identity Continuity 100.0 Export/import round-trip fidelity, memory count preservation, state recovery
D7: Portability 100.0 Engine-independent design verification
Composite SHS 90.2 Weighted average (D1 excluded)

The entire eval suite runs without an LLM. No API calls, no cost, fully reproducible. The heuristic baseline is the honest floor, not the ceiling.

LLM judge validation. When we ran the same scenarios with Claude Haiku as an evaluator, sentiment accuracy jumped from 70% to 97%. The 17 items the heuristic missed were all context-dependent emotions that no word-list can detect ("my daughter took her first steps" → joy, "they cancelled the project I poured six months into" → frustration). The architecture handles them correctly when a real LLM is plugged in. The gap between 70% and 97% isn't a flaw — it's the difference between the HeuristicEngine fallback and a CognitiveEngine with actual reasoning.

Psychology stack validation

We validated the psychology foundations through 475+ heuristic-only interactions:

Portability

A soul carrying 40 conversations serialized into a 4,293-byte .soul file. After re-awakening: every count matched (episodic, semantic, graph). Recall behavior was identical. Nothing lost in transit.

13 Roadmap

v0.6.0
Learning events
A soul that only records what happened is a log file. A soul that records what it learned is a cognitive system. Learning Events formalize the feedback loop: trigger, lesson, confidence, source, and domain. These events travel in the .soul file. When imported into a new runtime, the agent starts with accumulated knowledge instead of from zero.
v0.7.0
Domain isolation and open layers
Memory layers should be user-defined namespaces, not a hardcoded enum. Domains should isolate context: a billing agent shouldn't see legal memory. The protocol defines the namespace mechanism. The runtime provides defaults.
v0.8.0
Trust chain
Cryptographic verification of a soul's history. Every memory write, personality mutation, and learning event gets signed and appended to a Merkle-verifiable chain. A soul can prove what it learned, when, and from what source.
Beyond
The long view
Conway autobiographical hierarchy (episodes to general events to lifetime narrative). Multi-soul communication and shared memory. Federation protocol. Production IPFS/Arweave integration. Protocol implementations in Go and Rust.

14 Conclusion

The problem with current AI memory isn't storage capacity or retrieval speed. These systems weren't designed to model a mind. They were designed to model a search engine.

Human memory is selective. It tags experiences with feeling. It strengthens what gets recalled and lets the rest fade. It gradually forms beliefs about who you are. A vector database bolted onto an LLM does none of this.

The empirical evidence supports this. When we tested Soul Protocol against a stateless baseline across five judge models from four competing providers, every single judgment favored the soul-enabled agent. All twenty. The largest gain came from emotional continuity: tracking the user's emotional arc and reflecting it back produced a 7.8-point improvement. When we benchmarked against Mem0, Soul Protocol led by 2.5 points overall. Not because our retrieval is better, but because we store emotional context alongside facts.

The component ablation told us which pieces matter. Memory retrieval drives recall. Emotional context drives continuity. Personality drives consistency. No single component is sufficient. The integrated approach consistently matches or exceeds any individual component.

The Soul Health Score framework gave us a different lens: end-to-end grading of a live soul instance across seven dimensions. The reference implementation scores 90.2/100 without any LLM. Bond tracking, identity continuity, and portability scored perfect marks. Personality expression hit 96%. The weakest link is heuristic sentiment detection at 70% accuracy — but when we plugged in Claude Haiku as a judge, accuracy jumped to 97%. The architecture is correct. The heuristic fallback is honest about its limitations instead of hiding them.

Goleman's argument was that the qualities that make humans effective aren't cognitive, they're emotional: self-awareness, the ability to read context, the capacity to learn from experience rather than just store it. The same argument applies to AI companions. The ones that will feel real, that users will actually bond with, won't be the ones with the best retrieval precision. They'll be the ones that remember what matters, forget what doesn't, and slowly become more themselves.

The protocol is 695 lines. Everything else is one implementation. We want others to build their own.

References

  1. Goleman, D. (1995). Emotional Intelligence: Why It Can Matter More Than IQ. Bantam Books.
  2. Damasio, A. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Putnam.
  3. Anderson, J.R. (1993). Rules of the Mind. Lawrence Erlbaum Associates.
  4. Franklin, S. et al. (2003). IDA: A Cognitive Agent Architecture. IEEE International Conference on Systems, Man and Cybernetics.
  5. Klein, S.B. (2004). The Cognitive Neuroscience of Knowing One's Self. In M.S. Gazzaniga (Ed.), The Cognitive Neurosciences III.

Technical reference