Back to projects
May 23, 2026
10 min read

Avatar

Public domain novel characters as AI agents. RAG + cognitive engine over the original text -- talk to Sherlock Holmes.

Overview

A system where characters from famous novels become AI agents and converse with users. It maintains the character’s personality, speech patterns, knowledge, and emotions based on the original novel text.

This is not simple role-play. The system automatically extracts the character’s personality (Big Five), speech patterns, values, and relationships from the novel text, retrieves relevant scenes via RAG, and uses a cognitive engine to generate character-faithful responses.

The first character is Sherlock Holmes (The Adventures of Sherlock Holmes, Arthur Conan Doyle, 1892). The MVP was completed by extracting 2,187 dialogues from 12 short stories and indexing 332 vector chunks.

Architecture

graph LR
    A[Gutenberg<br/>Novel Text] -->|Download/Parse| B[Ingestion Pipeline]
    B -->|Profile Extraction| C[Character Profile<br/>Big Five · Speech · Relations]
    B -->|Chunk Indexing| D[ChromaDB<br/>332 chunks]
    E[User Input] --> F[Character Cognitive Engine]
    D -->|RAG Search| F
    C -->|Personality/Speech| F
    G[Conversation Memory<br/>Last 20 turns] -->|History| F
    F -->|System Prompt| H[Claude CLI]
    H -->|Response| I[CLI / Web]

Full System Detail

graph TD
    subgraph Ingestion["Novel Ingestion Pipeline"]
        I1[Download from Gutenberg] --> I2[Strip Header/Footer]
        I2 --> I3[Split into 12 Stories]
        I3 --> I4[Extract Dialogues<br/>2,187 · Speaker-tagged]
        I3 --> I5[Chunk Text<br/>400 words · 80 overlap]
        I3 --> I6[Extract Character Profile<br/>via Claude CLI]
    end

    subgraph Engine["Character Cognitive Engine"]
        E1[Receive User Input] --> E2[RAG Search<br/>Top 3 Scenes]
        E2 --> E3[Load Profile<br/>Big Five · Speech · Relations]
        E3 --> E4[Build System Prompt]
        E4 --> E5[Call Claude CLI]
        E5 --> E6[Update Memory]
    end

    subgraph Memory["Memory System"]
        M1[Working Memory<br/>Current Conversation]
        M2[Episodic Memory<br/>Past Sessions]
        M3[Semantic Memory<br/>Character Knowledge · Novel Facts]
    end

    I4 --> DB[(data/processed/)]
    I5 --> VDB[(ChromaDB)]
    I6 --> Prof[(data/profiles/)]

    VDB --> E2
    Prof --> E3
    M1 --> E4

Tech Stack

CategoryTechWhy
LanguagePython 3.11+Rich NLP ecosystem
LLM BackendClaude Code CLI (claude -p)No API key needed, local CLI
Vector DBChromaDB (cosine similarity)Simple setup, local embeddings
Embeddingsall-MiniLM-L6-v2 (ONNX)Lightweight, local
Text SourceProject GutenbergPublic domain, free full text
Knowledge GraphNetworkX + JSONLightweight, fast prototyping
CLI UIRichColorful terminal output

Pipeline

1. Ingestion — Novel Text Collection & Parsing

Downloads the original text from Gutenberg and transforms it into structured data.

  • Download: Fetch full text from Project Gutenberg via requests
  • Preprocessing: Auto-strip Gutenberg header/footer, Unicode normalization
  • Story Splitting: Auto-split 12 stories by Roman numeral + uppercase title patterns
  • Dialogue Extraction: Unicode curly quote matching ("..."), speaker tagging (said/cried/asked + name patterns)
  • Vector Indexing: 400-word chunks (80-word overlap) indexed in ChromaDB with cosine similarity

2. Character Profile Extraction

Uses Claude CLI to automatically extract character profiles from the novel text.

FieldExtracted Content (Holmes example)
Big FiveOpenness 0.95 · Conscientiousness 0.82 · Extraversion 0.30 · Agreeableness 0.25 · Neuroticism 0.40
Speech Patterns”Pray take a seat” · “Quite so!” · “What do you make of that?” + 10 patterns
ValuesIntellectual rigor, pursuit of truth, beauty of the bizarre, personal autonomy + 8 values
RelationshipsWatson (affection behind condescension), Irene Adler (singular respect), Moriarty + 5 relations
Reasoning StyleEmpirical observation → deductive inference, Socratic reveal
Emotional TendenciesEmotional suppression, intellectual excitement, cocaine-ambition oscillation
HabitsViolin, cocaine, disguise, irregular hours, real-time reference lookup

Each field is stored with source text evidence for traceability.

3. Character Cognitive Engine

Acts as the character’s “brain.” Each turn follows this process:

  1. RAG Search — Retrieve 3 relevant novel scenes from ChromaDB
  2. Conversation History — Load last 20 turns of context
  3. System Prompt Construction — Combine profile (Big Five, speech, values, relationships, habits, reasoning style) + RAG context + history into a single prompt
  4. Claude CLI Call — Generate character response from the composed prompt
  5. Memory Update — Add new dialogue to history

The system prompt includes 10 behavioral rules: character maintenance, reasoning demonstration, knowledge boundary enforcement, user language adaptation, etc.

4. Validation

Passed a 5-turn conversation test:

TestInputResult
RAG Accuracy”Orange seed letter”Accurate recall of the “Five Orange Pips” case
Conversation Memory”K.K.K.” (continuation)Maintained context from previous turn
Language SwitchingReasoning method Q in EnglishCharacter-consistent response in English
Emotional Expression”Evaluate Watson”Natural blend of praise, critique, and affection
Deep Question”Do you ever feel lonely?”Denial with loneliness between the lines

Novel Contributions

Three differentiators from existing character AI research (Character-LLM, RoleLLM, CoSER, etc.). Each can serve as an independent research contribution.

1. Chapter-wise Character Arc Tracking

Existing research treats characters as static profiles. Avatar models how characters evolve along the novel’s timeline.

agent.set_narrative_position(chapter=3)
# Converse with knowledge/emotions/relationships up to chapter 3 only
# → "Irene Adler? I have never heard that name." (before Story 1)
# → "The woman... the only one who ever outwitted me." (after Story 1)
  • Generate character state snapshots per story
  • Add temporal axis to the knowledge graph (story number)
  • Limit RAG search scope by narrative_position
  • Spoiler prevention logic

2. Second-order Theory of Mind

Holmes can reason “Watson would have thought this way.” The character infers other characters’ and the user’s beliefs. An unexplored area in existing research.

  • Model asymmetric knowledge/relationships between characters
  • “What Holmes thinks Watson’s perspective is” inference module
  • “What Holmes infers about the user’s knowledge level” adaptation module
  • ToM accuracy benchmark design

3. Automatic Character Logic Extraction

Automatically extract behavioral rules from the novel text and convert them into executable code. An automation of the Codified Character Logic (2025) approach.

# Auto-extracted behavioral rules for Holmes
holmes_rules = {
    "when_presented_with_mystery": "ask_for_details_systematically",
    "when_complimented": "deflect_with_dry_humor",
    "when_bored": "express_restlessness_or_seek_stimulation",
    "when_watson_is_wrong": "correct_gently_but_show_reasoning",
}

Character Candidates

14 characters selected from the public domain:

NovelCharacterTraits
The Adventures of Sherlock HolmesSherlock HolmesUltra-logical reasoning, dry wit (1st implementation)
Pride and PrejudiceElizabeth BennetSharp wit, social observation (2nd implementation)
DraculaCount DraculaFormal, threatening seductive speech
FrankensteinThe CreaturePhilosophical monologue on identity
Crime and PunishmentRaskolnikovIntense psychological interiority, moral anguish
The Brothers KaramazovIvan KaramazovLiterature’s greatest philosophical voice
Monte CristoEdmond DantèsSelf-reinvention, multiple identities
Don QuixoteDon QuixoteIdealism vs. reality
Alice in WonderlandMad HatterLogic-bending, absurdist philosophy
Huckleberry FinnHuck FinnDistinctive vernacular, moral struggle
Dr Jekyll & Mr HydeJekyll/HydeDual personality switching
Journey to the WestSun WukongTrickster, East Asian cultural icon
The Tale of GenjiHikaru GenjiWorld’s first novel, emotional complexity
The Great GatsbyJay GatsbyMysterious, ambitious, tragic (public domain 2021)

Roadmap

Phase 1: MVP (Complete)

CLI-based agent for conversing with Sherlock Holmes.

  • Novel text download and parsing (12 stories, 2,187 dialogues)
  • Auto-extracted character profile (Big Five, 10 speech patterns, 8 values, 5 relationships)
  • RAG-based conversation (ChromaDB, 332 chunks)
  • CLI conversation interface (Rich UI)
  • Basic conversation memory (in-session, last 20 turns)
  • 5-turn conversation test passed (Korean/English)

Phase 2: Deepening + Elizabeth Bennet

graph LR
    A[Add Elizabeth Bennet<br/>Generalize System] --> B[Long-term Memory<br/>MemGPT-style]
    B --> C[BDI Module<br/>Belief-Desire-Intention]
    C --> D[Consistency<br/>Evaluation Foundation]
  • Elizabeth Bennet: Parse Pride and Prejudice, extract profile, character selection UI. Remove Holmes hardcoding to generalize the system
  • Long-term Memory: MemGPT-style 2-tier memory — cross-session persistence, per-user relationship modeling (“Last time you mentioned…”)
  • BDI Module: Infer character’s beliefs/desires/intentions each turn and include in prompt. Based on CharacterBox (2024)
  • Consistency Reinforcement: Big Five consistency validation, knowledge boundary enforcement (RoleRAG), speech pattern verification (CharacterBench)

Phase 3: Research-grade Features

  • Chapter-wise character arc tracking — temporal axis on knowledge graph
  • Second-order Theory of Mind — inter-character belief inference module
  • Automatic character logic extraction — novel text → if-then behavioral rules
  • Reader progress awareness — spoiler prevention
  • Multi-axis evaluation framework — CharacterBench + VER (NAACL 2025) + human evaluation

Phase 4: Publication & Release

  • Paper writing (ACL / EMNLP / NeurIPS Workshop)
  • Open-source GitHub release
  • Streamlit interactive demo
  • Blog post & conference presentation

Research Background

18 papers surveyed across 7 areas, informing the architecture design.

AreaKey PapersProject Application
Character DialogueCoSER (ICML 2025), OpenCharacter (2025)Character experience reconstruction pipeline from novel text
Literary RAGRoleRAG (2025), ComoRAG (2025)Character knowledge boundary graph-based retrieval
Long-term MemoryMemGPT (2023), A-Mem (2025), Memory OS (EMNLP 2025)2-tier memory (episodic + semantic)
Personality ModelingInCharacter (2024), BIG5-CHAT (ACL 2025)Big Five auto-extraction & consistency evaluation
Cognitive ArchitectureCharacterBox (2024), CoALA (2023)BDI model, modular cognitive architecture
EvaluationCharacterBench (2024), VER (NAACL 2025)Multi-axis: knowledge, personality, emotion, speech
EmergingCodified Character Logic (2025), Neeko (EMNLP 2024)Auto behavioral rule extraction, multi-character switching

Project Structure

_avatar/
├── src/
│   ├── main.py                 # CLI entry point
│   ├── ingestion/
│   │   ├── downloader.py       # Gutenberg downloader
│   │   ├── parser.py           # Story/dialogue parser
│   │   └── profile_extractor.py # Character profile extraction
│   ├── character/
│   │   ├── engine.py           # Cognitive engine (core)
│   │   ├── profile.py          # Profile loader
│   │   └── prompt_builder.py   # System prompt builder
│   ├── memory/
│   │   ├── conversation.py     # Conversation history
│   │   └── retriever.py        # ChromaDB RAG
│   └── evaluation/             # Evaluation framework (Phase 2+)
├── data/
│   ├── raw/                    # Original novel text
│   ├── processed/              # Parsed structured data
│   ├── profiles/               # Character profile JSON
│   └── vectordb/               # ChromaDB vector store
└── docs/                       # Research · Architecture · Roadmap