Deep Reading for Real Meaning

Deep Reading for Real Meaning is a high-precision knowledge extraction system designed to process large, complex corpora of text, mapping internal structures and verifying definitions against the outside world to produce structured, verified knowledge assets.

The naive approach to processing large amounts of text with Large Language Models (LLMs) is to dump document batches directly into a prompt and ask for a summary.

Instead, I built a system to systematically deep read documents. I break the documents into overlapping chunks, extract key terms and relationships piece-by-piece, link them into a cohesive network map, and verify the internal meanings against external definitions. This ensures every extracted fact is anchored directly to source evidence.

Source-Anchored Knowledge Mapping: Extracting granular chunks into an interconnected, verified network graph.

The hallucination that got lawyers sanctioned

LLMs generate convincing text, but they fabricate. The larger the text block, the more likely it is to hallucinate. A confident wrong answer is worse than no answer at all.

Recently, a pair of New York attorneys ran into this and trusted ChatGPT to do their legal research. The model did not just get it wrong. It fabricated entire cases, complete with docket numbers, judicial quotes, and citations that never existed.

The result was disastrous. The judge found that the lawyers acted in "subjective bad faith," fined them thousands of dollars, and the story became a global warning about the dangers of blind trust in these tools.

When I rely on AI to parse the noise, I risk total 'hallucination' by an AI that prioritizes sounding right over being right.

AI Hallucination vs. Fact Anchoring: Replacing ungrounded model outputs with strict source citations.

Emic vs. Etic: Two Dimensions of Meaning

True precision reading requires a system that maps the internal vocabulary and structure of your documents before cross-referencing them with external standards. Not simple summarization, but structural comprehension.

This approach distinguishes between two distinct perspectives of truth: the emic (insider) meaning and the etic (outsider) meaning. The pipeline first maps the emic truth: how the corpus defines terms and concepts strictly on its own terms, from the inside. Only after capturing this internal vocabulary does it layer in the etic truth: the standard definitions and meanings established by the outside world. Contrasting these two perspectives exposes critical conceptual divergences, highlighting exactly where the documents deviate from public consensus.

Emic vs. Etic: Mapping the internal vocabulary of your corpus and cross-referencing it with public definitions to isolate conceptual divergence.

The Processing Pipeline

Deep Reading is built on top of LlamaIndex, GraphRAG, and Ollama. Each handles the layer it is best suited for. The pipeline adds the coordination, multi-pass extraction, and emic/etic verification that none of them provide on their own.

Layer 1: Ingestion
Converts raw inputs (webpages, media, files, transcripts) into clean, standardized text. LlamaIndex handles chunking and overlap, turning unstructured sources into consistent passages ready for extraction.
Layer 2: Extraction & Merge
Each chunk is passed to a local Ollama model in sequence. Dedicated passes extract technical terms, definitions, and relationships strictly at face value. A merge pass reconciles duplicates and flags observations across the full corpus.
Layer 3: Graph & Community Detection
The merged terms and relationships are fed into GraphRAG, which builds a directed graph and runs community detection to group terms that share argumentative context. Ollama then summarizes each community and flags circular assumptions, filling the gap GraphRAG leaves at the summarization layer.
Layer 4: External Verification
Internal definitions can drift from accepted meaning. For each term, the system searches the web, fetches the results, and passes both the emic (internal) meaning and etic (external) sources to Ollama. It classifies divergence on a scale from fully aligned to contradicted. This layer has no equivalent in LlamaIndex or GraphRAG.
Layer 5: Export & Synthesis
Assembles the verified terms, community summaries, and relationship graph into final outputs: a glossary of verified terms, logical cause-and-effect chains, and synthetic Q&A pairs, all grounded strictly in the source text.

The Processing Pipeline: From universal ingestion to final structural export and verified Q&A generation.

Zero Variable Cost for Multi-Pass Audits

This pipeline runs a lot of LLM calls. Every document is sliced into overlapping chunks, passed through multiple models independently, checked for contradictions, and verified against external sources.

The query count adds up fast, and that translates directly into cost. Running three models over a thousand pages of text would mean thousands of dollars in API bills per run.

By running locally on my own hardware, all of that computation is free. I can rerun the full pipeline, swap models, or audit a different corpus without watching a billing meter. The data never leaves my machine either.

Zero Variable Cost: Processing dense verification passes locally bypasses cloud API expenses completely.

Use Cases and Exports

I shipped the first version on QuantGreenBook.com: defined terms, question-and-answer pairs, and causal axioms from finance study text.

Layer 5 (exporting) renders it to whatever format you need. You choose the target; there is no default export. Here are some examples:

Case law: every claim tied to a cited passage
Research literature: definition drift flagged across studies
Regulatory filings: policy and definition changes tracked across versions
Course outlines, decision records, flashcards: same graph, different export shape

Get Started Prompt

# Deep Reading for Real Meaning: Architecture Prompt

Before building anything, ask the user what corpus they have and what export
format they want. See export guardrails and examples below...

Show full prompt ▼

# Deep Reading for Real Meaning: Architecture Prompt

You are working on **Deep Reading for Real Meaning**, a domain-agnostic
knowledge extraction pipeline that turns any large document corpus into a
verified, queryable knowledge graph.

## Start here: ask the user

Open by asking what they want. Do not assume a format or domain.

1. **Use case:** What corpus? Who reads the output? What will they do with it?
2. **Export:** What file or structure should Layer 5 produce? If they have not
decided, describe the verified graph and offer examples from the list below.
Let them pick, combine, or define something new.

If answers are vague, ask follow-up questions. Do not default to Q&A, axioms,
or any example format without explicit confirmation.

## The problem

Feeding documents directly into an LLM produces hallucinations: fabricated
citations, invented definitions, confident nonsense. The model fails because
it has no map of the domain. It treats every sentence with equal weight,
doesn't know which terms carry specialized meaning, and can't detect when
definitions shift across documents or contradict external standards.

## The core idea: emic before etic

The pipeline reads in two passes.

**Emic:** the internal truth. What do these documents actually say?
Which terms do they define, and how? What relationships do they assert?
This is extracted strictly on face value, anchored to cited passages.

**Etic:** the external truth. How does the rest of the world define
these same terms? Where does the corpus align with established standards,
and where does it diverge or contradict them?

The gap between emic and etic is the signal. It surfaces wherever a
document uses language in a specialized, shifted, or misleading way,
regardless of domain.

## Pipeline

```
[ any source: files · URLs · video · search ]
|
▼
┌─────────────────────┐
│ INGEST │ normalize any source to clean text
└──────────┬──────────┘ no LLM (deterministic)
▼
┌─────────────────────┐
│ EXTRACT & MERGE │ pull terms, definitions, relationships
└──────────┬──────────┘ per chunk · deduplicate · flag contradictions
▼
┌─────────────────────┐
│ GRAPH │ build directed knowledge graph
└──────────┬──────────┘ detect communities · causal chains · cycles
▼
┌─────────────────────┐
│ VERIFY │ compare internal vs external definitions
└──────────┬──────────┘ classify divergence per term
│
┌──────┴───────────────────────┐
│ canonical output lives here │
│ verified/ ← definitions │
│ communities/ ← structure │
│ graph/edges ← relationships│
└──────┬───────────────────────┘
▼
┌─────────────────────┐
│ EXPORT │ render the graph for your use case
└─────────────────────┘ (see: what do you need?)
```

Every layer writes inspectable files to disk. Any layer can be run
individually, its output edited, tested, and the pipeline resumed.
Adding new documents merges into the existing graph without full reprocessing.

## The canonical output

The real product is produced at the Verify layer: a **verified knowledge
graph** where every node is a term with:

- its internal meaning (cited from the corpus)
- its external standard definition (sourced from the web)
- a divergence classification (Aligned / Minor / Significant / Contradicted)
- its relationships to other terms
- its community (the argumentative cluster it belongs to)

This graph is domain-agnostic. What you do with it depends on your use case.

## Export: guardrails and examples

Layer 5 adapts the verified graph to a user-chosen format. There is no
built-in default.

### Guardrails (always apply)

- Every exported claim must cite a passage from the source corpus.
- Do not skip or shortcut the Verify layer to speed up exports.
- If the user names a custom format, agree on schema first: fields, file
format, and grounding rules. Then implement.
- Prefer reading from `verified/`, `communities/`, and `graph/edges/` over
re-prompting the raw corpus.

### Examples (inspiration only, not presets)

These are formats that have come up in practice. The user picks one,
several, none of these, or something different.

**Finance / QuantGreenBook:** Q&A pairs; axioms (causal statements);
concept dependency maps; Anki CSV flashcards; divergence reports (source vs.
consensus).

**Law:** holdings per case; IRAC issue trees; precedent citation chains;
statutory "if X then Y" rules; counterargument maps per doctrine.

**Academic / research:** annotated bibliographies; hypothesis trees;
literature gap reports; debate cards (claim + warrant + evidence).

**Instructional / curriculum:** Bloom-tagged learning objectives;
difficulty-tagged assessment banks; scope and sequence outlines.

**Engineering / technical:** ADR-style decision records; constraint
axioms; failure mode lists.

**Generic:** human-readable reports; agent tool objects (`corpus.term(name)`);
RAG chunk corpora; JSONL fine-tuning sets; graph DB dumps; convergence audits.

Exports can mix **rules** (statements, holdings), **structure** (trees, maps),
and **evaluation** (Q&A, flashcards) in whatever proportion the user specifies.

## Core constraints

- **Source-anchored.** Every claim traces to a cited passage. The graph
makes hallucinations visible, not invisible.
- **Local inference.** All LLM processing runs on-device. No data leaves
the machine. No per-token cost.
- **Multi-model consensus.** Multiple models process each LLM stage
independently and merge results. Safety-first for contradictions,
most-complete for synthesis.
- **Calibrated output.** Every scoring step uses anchor-based two-pass
calibration to prevent flat, uncalibrated judgments.
- **Idempotent.** Re-running any layer against unchanged input skips
already-processed work. Interrupted runs resume mid-corpus.
- **Inspectable.** No hidden state. Every intermediate artifact is a
readable file. Any layer's output can be substituted by hand.

All views are my own, and I do not represent any employer. All ownership of attached open source code is waved.