Use Cases & Applications
Pillar Guide

RAG Applications with MCP: Vector DB + Document Servers

Building Retrieval-Augmented Generation (RAG) pipelines with MCP — connecting vector databases, document servers, and knowledge bases to AI applications.

20 min read
Updated February 25, 2026
By MCP Server Spot

Retrieval-Augmented Generation (RAG) is one of the most impactful applications of MCP. By connecting vector databases, document servers, and knowledge bases through the Model Context Protocol, you can build AI applications that answer questions grounded in your actual data -- not just the model's training knowledge. MCP makes RAG dramatically simpler: instead of building custom retrieval pipelines, you connect MCP servers and the AI handles the rest.

This guide covers how to build RAG applications with MCP, from basic setups to production-grade architectures.

How RAG Works with MCP

Traditional RAG requires custom code for each step: loading documents, chunking them, generating embeddings, storing them in a vector database, and retrieving them at query time. MCP simplifies this by providing standardized interfaces for each component.

The MCP RAG Architecture

User Query: "What is our company's refund policy?"
     │
     ▼
┌─────────────────────────────────────────────────┐
│                  AI Client (Claude)              │
│                                                  │
│  1. Reformulate query for retrieval              │
│  2. Search vector DB for relevant chunks         │
│  3. Optionally query SQL DB for structured data  │
│  4. Read source documents for full context       │
│  5. Generate grounded answer with citations      │
│                                                  │
└──────┬──────────────┬──────────────┬────────────┘
       │              │              │
  ┌────▼────┐   ┌─────▼─────┐  ┌────▼─────┐
  │ Vector  │   │ Document  │  │ Database │
  │ DB MCP  │   │ Server    │  │ MCP      │
  │ Server  │   │ (Files)   │  │ Server   │
  │         │   │           │  │          │
  │ Chroma  │   │ Filesystem│  │ Postgres │
  │ Pinecone│   │ Google Dr │  │ MySQL    │
  │ Qdrant  │   │ Notion    │  │ MongoDB  │
  └─────────┘   └───────────┘  └──────────┘

Basic RAG Flow

  1. User asks a question
  2. AI generates a search query from the user's question
  3. Vector DB MCP server performs similarity search, returning the most relevant document chunks
  4. AI reads the retrieved chunks and identifies the best answer
  5. AI generates a response grounded in the retrieved context, with citations

Setting Up a Basic RAG System

Step 1: Choose Your Vector Database

Select a vector database based on your needs:

DatabaseBest ForMCP Server
ChromaLocal development, prototypingmcp-server-chroma
PineconeProduction SaaS, managed hostingmcp-server-pinecone
QdrantHigh-performance, self-hostedmcp-server-qdrant
WeaviateMulti-modal, schema-rich datamcp-server-weaviate

For detailed comparisons, see our Database & Vector DB MCP Servers guide.

Step 2: Configure Your MCP Servers

A basic RAG setup with Chroma and the filesystem:

{
  "mcpServers": {
    "chroma": {
      "command": "npx",
      "args": ["-y", "mcp-server-chroma"],
      "env": {
        "CHROMA_HOST": "localhost",
        "CHROMA_PORT": "8000"
      }
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/knowledge-base"
      ]
    }
  }
}

Step 3: Ingest Documents

Document ingestion prepares your knowledge base for retrieval:

User: "Index all the markdown files in the docs directory
       into the knowledge-base Chroma collection"

Claude's workflow:
1. (Filesystem) list_directory("/docs") → find all .md files
2. (Filesystem) read_file() for each → get content
3. For each document:
   a. Split into chunks (500-token passages with overlap)
   b. (Chroma) add_documents(collection="knowledge-base",
      documents=[chunk_texts],
      metadatas=[{source, title, section}],
      ids=[unique_chunk_ids])
4. Report: "Indexed 150 chunks from 23 documents"

Step 4: Query the Knowledge Base

Once indexed, the AI can answer questions from your data:

User: "What are the system requirements for our enterprise plan?"

Claude's workflow:
1. (Chroma) query(
     collection="knowledge-base",
     query_text="system requirements enterprise plan",
     n_results=5
   ) → retrieve relevant chunks
2. Read the top 5 results with their metadata
3. Synthesize an answer citing the source documents:

Answer: "According to our Enterprise Plan documentation
(source: docs/pricing/enterprise.md), the system requirements are:
- Minimum 8 CPU cores
- 32 GB RAM
- 100 GB SSD storage
- Linux (Ubuntu 20.04+ or RHEL 8+)
..."

Advanced RAG Architectures

Hybrid RAG: Vector + SQL

Combine semantic search with structured data for more comprehensive answers:

{
  "mcpServers": {
    "pinecone": {
      "command": "npx",
      "args": ["-y", "mcp-server-pinecone"],
      "env": {
        "PINECONE_API_KEY": "your_key",
        "PINECONE_ENVIRONMENT": "us-east-1-aws"
      }
    },
    "postgres": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-postgres",
        "postgresql://readonly:pass@host/db"
      ]
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/docs"
      ]
    }
  }
}

Hybrid Retrieval Workflow:

User: "How did customer Acme Corp's usage change after
       upgrading to the enterprise plan?"

Claude's hybrid retrieval:
1. (Pinecone) query("Acme Corp enterprise upgrade") →
   Find relevant documentation and notes
2. (Postgres) query("SELECT plan, usage_hours, month
   FROM customer_usage
   WHERE customer_name = 'Acme Corp'
   ORDER BY month") → Get actual usage data
3. (Filesystem) read_file("accounts/acme-corp/notes.md") →
   Get account notes

Synthesized answer combines:
- Qualitative context from documents
- Quantitative data from the database
- Internal notes from the filesystem

Multi-Source RAG

Connect multiple knowledge sources for comprehensive retrieval:

Knowledge Sources for a Support RAG System:

1. Vector DB (Pinecone)
   └── Product documentation (chunked and embedded)
   └── FAQ entries
   └── Past support ticket resolutions

2. SQL Database (Postgres)
   └── Customer account data
   └── Subscription and billing info
   └── Feature flags and configurations

3. Notion
   └── Internal knowledge base
   └── Engineering runbooks
   └── Product decision documents

4. Slack (search)
   └── Recent engineering discussions
   └── Incident postmortems

The AI intelligently routes queries to the right source:

  • "How do I configure SSO?" → Vector DB (product docs)
  • "What plan is customer X on?" → SQL Database
  • "Why did we deprecate feature Y?" → Notion (product decisions)
  • "Is there a known issue with the API?" → Slack (recent discussions)

Agentic RAG

Agentic RAG adds autonomous behavior to the retrieval process. Instead of a single retrieval step, the AI iteratively searches, evaluates, and refines until it has enough context:

User: "Explain the technical architecture of our payment system
       and identify potential single points of failure"

Agentic RAG workflow:
Step 1: Initial retrieval
  → query("payment system architecture") → 5 chunks
  → AI evaluates: "Found high-level overview but missing
     database and queue architecture details"

Step 2: Targeted follow-up
  → query("payment database schema") → 3 chunks
  → query("payment message queue configuration") → 3 chunks
  → AI evaluates: "Now have DB and queue info. Need failure
     mode documentation."

Step 3: Specific search
  → query("payment system failure modes SLA") → 4 chunks
  → (Postgres) query architecture_decisions table
  → AI evaluates: "Sufficient context to answer comprehensively"

Step 4: Generate answer
  → Synthesize all retrieved context into a detailed
     architectural analysis with SPOF identification

Document Processing Pipeline

Chunking Strategies

How you split documents into chunks significantly impacts retrieval quality:

StrategyChunk SizeOverlapBest For
Fixed-size500 tokens50 tokensGeneral purpose
Sentence-based3-5 sentences1 sentenceQ&A, factual content
Paragraph-basedNatural paragraphsNoneWell-structured docs
SemanticVariableN/AMixed-format content
Heading-basedSection contentInclude headingTechnical documentation
RecursiveVariable, decreasingVariableLong documents

Recommended Chunking Approach

def chunk_document(text: str, max_tokens: int = 500, overlap: int = 50):
    """
    Chunk a document with the following strategy:
    1. Split by headings first (preserve document structure)
    2. If a section exceeds max_tokens, split by paragraphs
    3. If a paragraph exceeds max_tokens, split by sentences
    4. Add overlap between chunks for context preservation
    """
    sections = split_by_headings(text)
    chunks = []

    for section in sections:
        if token_count(section.content) <= max_tokens:
            chunks.append(Chunk(
                text=section.content,
                metadata={
                    "heading": section.heading,
                    "level": section.level
                }
            ))
        else:
            # Recursive splitting
            sub_chunks = split_with_overlap(
                section.content,
                max_tokens=max_tokens,
                overlap=overlap
            )
            for sub in sub_chunks:
                chunks.append(Chunk(
                    text=sub,
                    metadata={
                        "heading": section.heading,
                        "level": section.level,
                        "is_partial": True
                    }
                ))

    return chunks

Metadata Enrichment

Rich metadata improves retrieval precision:

chunk_metadata = {
    "source": "docs/api/authentication.md",
    "title": "API Authentication Guide",
    "section": "OAuth 2.0 Configuration",
    "document_type": "technical_documentation",
    "last_updated": "2026-02-15",
    "author": "Engineering Team",
    "tags": ["api", "authentication", "oauth", "security"],
    "version": "2.1",
    "word_count": 450,
    "chunk_index": 3,
    "total_chunks": 12
}

Metadata enables filtered retrieval:

# Only search recent documentation
query(text="oauth configuration",
      filter={"last_updated": {"$gte": "2026-01-01"}})

# Only search API documentation
query(text="rate limiting",
      filter={"document_type": "technical_documentation",
              "tags": {"$contains": "api"}})

Embedding Models and Strategies

Choosing an Embedding Model

ModelDimensionsPerformanceCostBest For
OpenAI text-embedding-3-large3072ExcellentPaid APIProduction RAG
OpenAI text-embedding-3-small1536GoodAffordableGeneral use
Cohere embed-v31024ExcellentPaid APIMultilingual
all-MiniLM-L6-v2384GoodFree (local)Local/offline
BGE-large-en1024Very GoodFree (local)Self-hosted production
Voyage AI voyage-21024ExcellentPaid APICode and technical

Embedding Integration

Some vector database MCP servers handle embedding automatically (Chroma, Weaviate with modules). For others, you generate embeddings before insertion:

# Embedding generation as part of the ingestion pipeline
import openai

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    return [item.embedding for item in response.data]

# Then upsert via MCP vector database server
# pinecone.upsert(vectors=[(id, embedding, metadata), ...])

Performance Optimization

Retrieval Quality

TechniqueDescriptionImpact
Query expansionGenerate multiple search queries from the user's questionHigher recall
RerankingRe-score retrieved results with a cross-encoderHigher precision
Contextual retrievalInclude surrounding context in chunksBetter coherence
Hypothetical documentsGenerate a hypothetical answer and embed itBetter retrieval for abstract queries
Metadata filteringPre-filter by date, source, or categoryFaster, more relevant

Query Expansion Example

User question: "How do I reset my password?"

Expanded queries:
1. "password reset process"
2. "forgot password recovery"
3. "account access recovery steps"
4. "change password instructions"

Each query retrieves different relevant chunks,
increasing the chance of finding the best answer.

Latency Optimization

StrategyDescription
Fewer chunksRetrieve 3-5 chunks instead of 10+
Smaller embeddingsUse 384 or 1024 dimensions instead of 3072
Local vector DBRun Chroma or Qdrant locally for minimal latency
CachingCache embeddings and frequent query results
Approximate searchUse ANN (Approximate Nearest Neighbors) with HNSW indices

RAG Evaluation

Building an Evaluation Dataset

Create a golden dataset of questions with expected answers and source documents:

[
  {
    "question": "What is the maximum file upload size?",
    "expected_answer": "The maximum file upload size is 100MB for free plans and 1GB for enterprise plans.",
    "expected_sources": ["docs/api/uploads.md"],
    "category": "factual"
  },
  {
    "question": "How do I configure SSO with Okta?",
    "expected_answer": "Configure SSO with Okta by...",
    "expected_sources": ["docs/admin/sso.md", "docs/admin/okta-setup.md"],
    "category": "procedural"
  }
]

Evaluation Metrics

MetricMeasuresTarget
Retrieval Recall@5% of relevant docs in top 5 results> 90%
Retrieval Precision@5% of top 5 results that are relevant> 70%
Answer AccuracyDoes the answer match the expected answer?> 85%
GroundednessIs the answer supported by retrieved docs?> 95%
Answer CompletenessDoes the answer cover all aspects of the question?> 80%
Latency (p95)Time from query to response< 5s

Automated Evaluation Pipeline

async def evaluate_rag(test_cases, mcp_client):
    results = []
    for case in test_cases:
        # Run the RAG pipeline
        retrieved = await mcp_client.call_tool(
            "query",
            {"query_text": case["question"], "n_results": 5}
        )

        # Check retrieval quality
        retrieved_sources = [r["metadata"]["source"] for r in retrieved]
        recall = len(
            set(retrieved_sources) & set(case["expected_sources"])
        ) / len(case["expected_sources"])

        # Generate answer
        answer = await generate_answer(case["question"], retrieved)

        # Check answer quality (using LLM-as-judge)
        accuracy = await judge_accuracy(
            answer, case["expected_answer"]
        )
        groundedness = await judge_groundedness(
            answer, retrieved
        )

        results.append({
            "question": case["question"],
            "retrieval_recall": recall,
            "answer_accuracy": accuracy,
            "groundedness": groundedness
        })

    return aggregate_metrics(results)

Production RAG Best Practices

1. Keep Documents Fresh

Stale embeddings lead to outdated answers:

  • Schedule regular re-indexing (daily or weekly)
  • Implement change detection for source documents
  • Version your embeddings (track which model and chunk settings were used)
  • Maintain a refresh log to track when each document was last indexed

2. Handle "I Don't Know" Gracefully

When the knowledge base does not contain an answer:

  • Set a similarity threshold (e.g., cosine similarity > 0.7)
  • If no chunks meet the threshold, respond honestly: "I could not find information about this in the knowledge base"
  • Suggest alternative queries or direct the user to the right resource

3. Provide Source Citations

Always cite the source documents in RAG responses:

Based on our API documentation (source: docs/api/auth.md,
updated 2026-02-10), the OAuth 2.0 flow requires:
1. Register your application at...
2. Configure redirect URIs...

4. Monitor and Iterate

Track these metrics in production:

  • Most common queries with no results (gaps in knowledge base)
  • User feedback on answer quality (thumbs up/down)
  • Retrieval latency distribution
  • Embedding storage growth over time

RAG Implementation Patterns

Pattern: Conversational RAG

Maintain conversation context across multiple retrieval queries:

Turn 1:
  User: "What's our API rate limit policy?"
  → Retrieve: API documentation chunks
  → Answer: "The rate limit is 1000 requests per minute..."

Turn 2:
  User: "How do enterprise customers get higher limits?"
  → Context: Previous turn was about rate limits
  → Retrieve: Enterprise plan + rate limit documentation
  → Answer: "Enterprise customers can request increased limits..."

Turn 3:
  User: "What's the process to request that?"
  → Context: Enterprise rate limit increases
  → Retrieve: Process documentation for limit increases
  → Answer: "To request a rate limit increase, contact your account manager..."

Each turn builds on the previous, and the retrieval query incorporates conversation context for more accurate results.

Pattern: Multi-Hop RAG

Some questions require chaining multiple retrievals:

User: "Who approved the change that caused the production outage last week?"

Multi-hop retrieval:
1. First hop: Search for "production outage last week"
   → Found: Incident report mentioning PR #456

2. Second hop: Search for "PR #456 approval"
   → Found: PR review with approving reviewer

3. Third hop: Search for reviewer's role and authority
   → Found: Reviewer is team lead, authorized approver

Answer: "PR #456, which caused the outage, was approved by
         Jane Smith (Engineering Team Lead) on February 18th.
         The change modified the database connection pool
         configuration."

Pattern: Summarization RAG

For questions requiring synthesis across many documents:

User: "Summarize all customer feedback about our onboarding process"

Workflow:
1. Retrieve all chunks tagged "customer_feedback" + "onboarding"
   (may return 50+ chunks)
2. Group by theme:
   - Documentation quality (15 mentions)
   - Setup complexity (12 mentions)
   - Time to first value (8 mentions)
   - Missing features (6 mentions)
3. Summarize each theme with representative quotes
4. Provide overall sentiment analysis

Common Pitfalls and Solutions

Pitfall 1: Irrelevant Retrieval

Problem: Vector search returns chunks that are semantically similar but not actually relevant.

Solution:

  • Add metadata filtering to narrow the search scope
  • Use reranking models to improve precision
  • Improve chunk quality (better boundaries, more context)
  • Fine-tune embedding models on your domain data

Pitfall 2: Lost Context at Chunk Boundaries

Problem: The answer spans two chunks, and neither chunk alone is sufficient.

Solution:

  • Use overlapping chunks (10-20% overlap)
  • Retrieve surrounding chunks when a match is found
  • Use larger chunk sizes for content that flows naturally
  • Include section headings in every chunk for context

Pitfall 3: Stale Embeddings

Problem: The knowledge base has been updated but embeddings are outdated.

Solution:

  • Implement change detection with file modification timestamps
  • Re-embed changed documents on a schedule (daily or on-change)
  • Version your embeddings so you can track what was indexed when
  • Display last-updated timestamps in RAG responses

Pitfall 4: Hallucination Despite Retrieval

Problem: The AI generates plausible-sounding information that is not in the retrieved context.

Solution:

  • Instruct the model to only answer from provided context
  • Require citations for every factual claim
  • Use lower temperature settings for factual responses
  • Implement automated groundedness checks

Pitfall 5: Context Window Overflow

Problem: Too many retrieved chunks overflow the AI's context window.

Solution:

  • Retrieve fewer chunks (3-5 instead of 10+)
  • Use more precise queries to improve retrieval quality
  • Summarize retrieved chunks before adding to context
  • Implement a two-stage retrieval: broad first, then narrow

Cost Optimization for RAG

Cost ComponentOptimization Strategy
Embedding generationBatch embeddings, cache results, use smaller models for low-priority content
Vector DB storageQuantize vectors, use lower dimensions, archive old data
Vector DB queriesOptimize query count, cache frequent queries, use metadata filters
AI model tokensReduce chunk count, summarize context, use smaller models for simple queries
InfrastructureRight-size vector DB instances, use spot instances for embedding jobs

Cost Breakdown Example

For a RAG system processing 1,000 queries per day:

ComponentMonthly Cost Estimate
Embedding generation (initial)$50-200 (one-time per corpus update)
Vector DB hosting$50-200 (Pinecone starter or self-hosted)
AI model queries$100-500 (depends on model and context length)
Compute (MCP servers)$20-100 (small instances or serverless)
Total$220-1,000/month

Self-hosted options (Chroma, Qdrant) can reduce vector DB costs to near-zero for smaller deployments.

RAG for Different Content Types

Different types of organizational content require different RAG strategies. Here is a guide to optimizing your RAG pipeline for common content categories.

Technical Documentation RAG

Technical documentation has unique characteristics that require specialized handling:

  • Code blocks: Preserve code blocks as atomic chunks rather than splitting them
  • Cross-references: Maintain links between related sections to enable multi-hop retrieval
  • Version sensitivity: Include version numbers in metadata to avoid returning outdated instructions
  • API references: Chunk by endpoint or function, including all parameters and examples in each chunk

Legal and Compliance Document RAG

Legal documents require high precision and complete citation:

RequirementRAG Implementation
Exact quotationStore original text alongside chunks, never paraphrase in retrieval
Section referencesInclude clause numbers, section headers, and page numbers in metadata
Temporal accuracyTrack effective dates and amendments, filter by applicable date
Jurisdictional scopeTag documents by jurisdiction, filter queries by relevant jurisdiction
CompletenessRetrieve entire relevant sections rather than individual paragraphs

Customer Support Knowledge Base RAG

Support content benefits from retrieval strategies optimized for resolution:

Optimized support RAG metadata:
{
  "source": "kb/billing/refund-process.md",
  "title": "How to Process a Refund",
  "category": "billing",
  "product": "SaaS Pro",
  "resolution_type": "self-service",
  "avg_resolution_time": "5 minutes",
  "related_articles": ["kb/billing/credits.md", "kb/billing/invoices.md"],
  "last_verified": "2026-02-01",
  "success_rate": 0.92
}

By including resolution metadata, the RAG system can prioritize articles with high success rates and flag articles that may need updating based on declining effectiveness.

Choosing the Right RAG Architecture

Content VolumeUpdate FrequencyQuery VolumeRecommended Architecture
< 1,000 docsWeeklyLow (< 100/day)Local Chroma + filesystem MCP
1K - 10K docsDailyMedium (100-1K/day)Qdrant self-hosted + scheduled re-indexing
10K - 100K docsContinuousHigh (1K+/day)Pinecone managed + event-driven indexing
100K+ docsContinuousVery highDistributed vector DB + dedicated embedding service

Matching your architecture to your actual scale prevents both over-engineering and under-provisioning, ensuring the best balance of cost, performance, and maintenance burden. As your document corpus grows, revisit these architectural decisions periodically to ensure your RAG infrastructure continues to meet performance and reliability expectations.

What to Read Next

Frequently Asked Questions

What is RAG and how does MCP enable it?

Retrieval-Augmented Generation (RAG) is a technique where an AI retrieves relevant documents from a knowledge base before generating a response, grounding its answers in real data. MCP enables RAG by providing standardized connections to vector databases (for similarity search), document servers (for source documents), and databases (for structured data). Instead of building custom retrieval pipelines, you connect MCP servers and the AI orchestrates the retrieval and generation automatically.

Which MCP servers do I need for a basic RAG setup?

A basic RAG setup requires two MCP servers: (1) a vector database server (Pinecone, Chroma, Qdrant, or Weaviate MCP) for storing and searching embeddings, and (2) a document source server (filesystem MCP, Google Drive MCP, or Notion MCP) for accessing the original documents. Optionally, add a database MCP server for structured data retrieval to complement the vector search.

How do I ingest documents into a vector database through MCP?

The ingestion workflow uses multiple MCP servers: (1) filesystem or document server reads the source files, (2) the AI or a processing pipeline chunks the documents into passages, (3) an embedding model generates vectors for each chunk, and (4) the vector database server stores the embeddings with metadata. Some vector database MCP servers (like Chroma) handle embedding generation automatically.

What is the difference between RAG with MCP and traditional RAG pipelines?

Traditional RAG pipelines are coded as fixed applications (using LangChain, LlamaIndex, etc.). MCP-based RAG is dynamic — the AI decides at runtime which documents to retrieve, which database to query, and how to combine results. MCP RAG is more flexible (the AI adapts its retrieval strategy per query) but traditional pipelines offer more control over the exact retrieval and ranking logic.

Can I use multiple data sources in a single RAG query?

Yes. This is a key advantage of MCP-based RAG. The AI can query a vector database for semantic search, a SQL database for structured data, and a filesystem for recent documents — all in the same workflow. The AI determines which sources to consult based on the query, and synthesizes information from all sources into a single response.

How do I handle document updates in MCP-based RAG?

Implement an update pipeline: (1) monitor source documents for changes (filesystem watcher or webhook), (2) re-chunk and re-embed changed documents, (3) upsert new embeddings into the vector database (replacing old versions by document ID). Some MCP server setups can automate this by periodically scanning for changes. For real-time updates, consider event-driven architectures with webhooks.

What chunk sizes work best for RAG with MCP?

Optimal chunk sizes depend on your content and use case. General guidelines: 200-500 tokens for Q&A over factual content, 500-1000 tokens for detailed technical documentation, 1000-2000 tokens for long-form analysis. Include overlap between chunks (10-20%) to preserve context. Start with 500 tokens and adjust based on retrieval quality.

How do I evaluate the quality of my MCP RAG system?

Evaluate RAG quality across three dimensions: (1) Retrieval quality — are the right documents being found? Measure with precision, recall, and MRR (Mean Reciprocal Rank), (2) Answer quality — is the AI generating correct answers from retrieved context? Use human evaluation or automated metrics, (3) Groundedness — does the answer cite the retrieved sources? Check for hallucinations beyond the retrieved context.

Related Guides