technical β€’

RAG Architecture for Legal Documents

Retrieval-Augmented Generation is the AI architecture that makes document intelligence practical for law firms. Here's how it works, why it's better than fine-tuning, and what privacy-first RAG looks like.

RAG Architecture for Legal Documents

The Problem RAG Solves

Language models are trained on vast datasets and develop impressive general knowledge, but they have no knowledge of your specific documents. Ask GPT-4 about a contract you signed last month, and it knows nothing about it. The model’s knowledge has a training cutoff and never included your files to begin with.

The naive solution β€” fine-tuning a model on your documents β€” has significant problems:

  • Cost: Fine-tuning large models requires substantial compute
  • Staleness: New documents require re-training
  • Hallucination risk: Fine-tuned models can β€œremember” training data in unreliable ways
  • Privacy: Your documents must be uploaded to a training infrastructure you may not control
  • No citations: Fine-tuned models generate text, not references to source documents

Retrieval-Augmented Generation (RAG) solves all of these without modifying the underlying model.

How RAG Works

RAG combines two systems: a retrieval system that finds relevant document passages, and a generation system (the language model) that synthesizes those passages into a coherent answer.

Step 1: Ingestion

Documents are processed into a searchable index:

  1. Each document is split into chunks (typically 512–1024 tokens with overlap)
  2. Each chunk is converted to a vector embedding β€” a numerical representation of its semantic meaning β€” using an embedding model
  3. The vectors are stored in a vector database alongside the original text chunks

The embedding model is the key: it converts text into numbers in a way that preserves meaning. Semantically similar passages produce similar vectors, regardless of exact wording. β€œAttorney-client privilege waiver” and β€œdisclosure to third parties may void privilege” will have similar embeddings even though they share no words.

Step 2: Retrieval

When a user asks a question:

  1. The question is converted to a vector using the same embedding model
  2. The vector database finds the chunks whose vectors are most similar to the question vector (nearest-neighbor search)
  3. The top K chunks (typically 3–10) are retrieved along with their source document references

This is semantic search: finding meaning, not keywords.

Step 3: Generation

The retrieved chunks are assembled into a context window and provided to the language model along with the user’s question:

System: You are a document analysis assistant. Answer questions based only on the provided context.
Always cite the source document and page number for each claim.

Context:
[Chunk 1: Contract clause about liability β€” Source: merger_agreement.pdf, p. 47]
[Chunk 2: Definition of "Material Adverse Effect" β€” Source: merger_agreement.pdf, p. 12]
...

User: What triggers the material adverse effect clause?

The model generates an answer grounded in the retrieved chunks, and because we instruct it to cite sources, every claim in the answer links back to a specific document and location.

Why This Works Better Than Fine-Tuning

AspectFine-TuningRAG
New documentsRequires re-trainingInstant (re-ingest)
CitationsNot supportedNative
Hallucination riskHigher (model β€œremembers”)Lower (grounded in retrieved text)
Cost per updateHighLow
ExplainabilityBlack boxTraceable to source
PrivacyDocuments in training dataDocuments in retrieval index only

Legal documents present specific challenges that general RAG implementations handle poorly:

Challenge 1: Document Structure

Legal documents have hierarchical structure β€” sections, subsections, clauses, exhibits β€” and cross-references (β€œsee Section 4.2(a)(iii)”). Naive chunking by character count breaks logical units and loses context.

Good legal RAG uses structure-aware chunking: splitting at clause boundaries rather than arbitrary character counts, preserving section headers, and maintaining parent context for nested clauses.

Challenge 2: Scanned Documents

Many legal documents exist as scanned PDFs β€” contracts executed on paper, court filings, older case materials. These require OCR (Optical Character Recognition) before text extraction.

Standard OCR produces text but loses layout information β€” which column a figure appears in, where annotations are, what a signature block looks like. For legal purposes, this matters. A clause’s position in a contract can affect its interpretation.

The Tacitus approach uses dual-payload OCR: extracting both the text layer (for semantic processing) and a visual layer (preserving the document’s visual structure for citation display). When the system cites a source, it can show you the exact location on the original page.

Challenge 3: Cross-Document Reasoning

Legal matters rarely involve a single document. A due diligence exercise might require reasoning across hundreds of contracts simultaneously: finding all indemnification clauses, comparing representations and warranties, identifying missing standard provisions.

This requires not just document-level retrieval but cross-document synthesis: the ability to retrieve and compare relevant clauses from many documents in a single query. The vector database must be designed to support filtering (by document type, date, counterparty) and ranking (by relevance) simultaneously.

Challenge 4: Confidentiality

This is where most commercial RAG implementations fail for legal use cases. To use a cloud RAG service, your documents must be:

  • Uploaded to the provider’s infrastructure
  • Indexed on the provider’s servers
  • Processed through the provider’s embedding models
  • Stored in the provider’s vector database

At each step, your privileged client documents are in contact with infrastructure you don’t control. The same privilege risk that applies to cloud AI generally applies to cloud RAG specifically.

Privacy-First RAG Architecture

The Tacitus implementation addresses confidentiality at the architectural level:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Tacitus Cortex (on-premises)   β”‚
β”‚                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Document │───▢│  Dual-Payload OCR β”‚  β”‚
β”‚  β”‚  Intake  β”‚    β”‚  (text + visual)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚              β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                  β”‚  Chunking Engine  β”‚  β”‚
β”‚                  β”‚ (structure-aware) β”‚  β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚              β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚   Embedding Model     β”‚  β”‚
β”‚              β”‚   (runs locally)      β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚              β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚    Qdrant Vector DB   β”‚  β”‚
β”‚              β”‚    (on-premises)      β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Query   │───▢│    Local LLM      β”‚  β”‚
β”‚  β”‚ Interfaceβ”‚    β”‚  (Mistral/Llama)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         No data leaves this boundary

Every component runs on the local appliance:

  • Embedding model: A quantized sentence transformer running on the local GPU
  • Vector database: Qdrant, an open-source vector database with no cloud dependencies
  • Language model: A quantized open-weight LLM (typically Mistral or Llama variants) running on the dedicated GPU
  • Citation engine: Maps model outputs back to source document locations

No API calls. No cloud embeddings. No external model inference. The entire pipeline runs on hardware under your physical control.

What This Means in Practice

For a law firm, privacy-first RAG means:

  • Document upload = stays local: Files are processed on the appliance and never leave your network
  • Queries = no external API calls: Questions go to the local model, not OpenAI or Anthropic
  • Answers = grounded and cited: Every response references the specific document and page that supports it
  • Audit trail = local logs: A complete record of who asked what and what was retrieved, stored on your systems

The AI operates like a very well-read associate who has read everything in your case files and can find the relevant passages instantly β€” but who is employed by you, works in your office, and cannot share anything they’ve read.

Getting Started

RAG is not a single product but an architecture. Implementing it well for legal documents requires careful attention to chunking strategy, embedding model selection, retrieval tuning, and prompt engineering. The privacy requirements add another layer of constraints that rule out most commercial offerings.

If you’re evaluating AI for document intelligence and want to understand what a privacy-first deployment looks like for your specific use case, a technical briefing is a good starting point.


The Tacitus Cloud Bridge is the fastest path to production-ready legal RAG without hardware investment. Request a trial to evaluate it against your document corpus.

#rag #ai #architecture #legal-tech #vector-search

Start Your Sovereign AI Trial

Experience Tacitus Cloud Bridge with a trial tailored to your evaluation needs. EU-hosted, single-tenant, and fully compliant from day one.