LLM Fine-Tuning vs RAG: When to Use Which

Posted on Sat 18 April 2026 in GenAI

LLM Fine-Tuning vs RAG: When to Use Which

A practical decision framework for teams building with LLMs — with real trade-offs, cost analysis, and when to combine both


Table of Contents


The Core Question

You're building an AI product. Your LLM doesn't know your data, your domain, or your tone. How do you fix that?

Two approaches dominate:

  • RAG (Retrieval-Augmented Generation): Give the model relevant information at query time by retrieving it from a knowledge base.
  • Fine-Tuning: Re-train the model on your data so the knowledge is baked into the weights.

Both work. Both have real trade-offs. Picking the wrong one costs months and thousands of dollars. This guide gives you a clear framework for deciding.


What Is RAG?

RAG keeps the base model frozen and dynamically injects relevant context at inference time.

User Query
    
[Embed query]  [Search vector DB]  [Retrieve top-k chunks]
    
[Augmented prompt: retrieved chunks + original query]
    
LLM generates answer grounded in retrieved context

The pipeline:

from openai import OpenAI
from pinecone import Pinecone

client = OpenAI()
pc = Pinecone(api_key="YOUR_KEY")
index = pc.Index("knowledge-base")

def rag_query(user_question: str) -> str:
    # 1. Embed the question
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_question
    ).data[0].embedding

    # 2. Retrieve relevant chunks
    results = index.query(vector=embedding, top_k=5, include_metadata=True)
    context = "\n\n".join([r.metadata["text"] for r in results.matches])

    # 3. Generate grounded answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. If the answer isn't in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"}
        ]
    )
    return response.choices[0].message.content

What Is Fine-Tuning?

Fine-tuning continues training a pre-trained model on your dataset, updating its weights to encode new knowledge, style, or behavior.

Base Model (frozen knowledge)
    ↓
[Your training data: (prompt, ideal_response) pairs]
    ↓
[Gradient updates via supervised learning]
    ↓
Fine-Tuned Model (knowledge baked into weights)

Training data format (OpenAI JSONL):

{"messages": [{"role": "system", "content": "You are a support agent for Acme SaaS."}, {"role": "user", "content": "How do I reset my API key?"}, {"role": "assistant", "content": "To reset your API key: go to Settings → API → Regenerate Key. Your old key is immediately invalidated."}]}
{"messages": [{"role": "system", "content": "You are a support agent for Acme SaaS."}, {"role": "user", "content": "What's the rate limit on the free plan?"}, {"role": "assistant", "content": "Free plan: 100 requests/minute, 10,000 requests/month. Upgrade to Pro for 1,000 req/min."}]}

Launching a fine-tune (OpenAI):

from openai import OpenAI

client = OpenAI()

# Upload training file
with open("training_data.jsonl", "rb") as f:
    file = client.files.create(file=f, purpose="fine-tune")

# Start fine-tune job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Fine-tune job started: {job.id}")
# Monitor: client.fine_tuning.jobs.retrieve(job.id)

Head-to-Head Comparison

Dimension RAG Fine-Tuning
Knowledge updates Real-time — just update the DB Requires retraining (hours/days)
Data freshness Always current Stale until retrained
Setup complexity Medium (pipeline + vector DB) High (data prep + training loop)
Cost to update Low (upsert new docs) High (full training run)
Inference cost Higher (embedding + retrieval + generation) Lower (just generation)
Handles new facts ✅ Excellent ❌ Needs retraining
Changes model behavior/style ❌ Limited ✅ Excellent
Reduces hallucination ✅ Strong (grounded in retrieved text) ⚠️ Moderate
Data requirements Documents/chunks 50–1000+ (prompt, response) pairs
Transparency High (can cite sources) Low (black box)
Privacy Data stays in your DB Data sent to training provider
Time to production Days Weeks

When to Choose RAG

RAG is the right default for most teams. Choose it when:

✅ Your knowledge changes frequently

News, product documentation, pricing, inventory, policy — anything that updates weekly, daily, or in real time. Retraining a model every time your docs change is impractical. RAG lets you update your knowledge base and the model immediately reflects it.

Good RAG use cases:
- Internal company knowledge base assistant
- Customer support bot with evolving product docs
- Legal document Q&A (regulations change)
- E-commerce catalog search & Q&A
- News summarization / research assistant

✅ You need source citations

RAG retrieves specific chunks — you always know which document the answer came from. This is essential for compliance, legal, and medical contexts where "the AI told me" isn't sufficient.

✅ You have large volumes of long-tail knowledge

A model can't memorize 50,000 support articles. RAG surfaces the right 3 at query time. Fine-tuning on 50,000 articles would require enormous training data and still wouldn't guarantee retrieval of the right fact.

✅ You're prototyping or iterating fast

Stand up a RAG pipeline in a day. Fine-tuning takes weeks of data preparation, training, and evaluation. Ship with RAG, decide later if fine-tuning adds enough value.

✅ Reducing hallucinations is the priority

By forcing the model to answer from retrieved context, RAG significantly reduces hallucinations on factual questions. It's not perfect, but it's the most reliable grounding technique available today.


When to Choose Fine-Tuning

Fine-tuning earns its cost when RAG can't solve the problem.

✅ You need to change how the model behaves, not just what it knows

RAG adds context. Fine-tuning changes behavior. If you need the model to consistently write in your brand's voice, follow a specific output schema every time, or reason like a domain expert — fine-tuning is the lever.

Good fine-tuning use cases:
- Consistent brand/tone across all outputs
- Domain-specific reasoning (medical diagnosis, legal analysis)
- Structured output compliance (always return valid JSON schema)
- Code generation in your internal framework/style
- Language localization (dialect, formality level)

✅ You have a well-defined, stable task

Fine-tuning excels at narrow, repeated tasks with clear right answers. Classify this support ticket. Extract these fields from this document. Convert this natural language query to SQL.

✅ Latency and cost matter at scale

RAG requires an embedding call + vector search + generation. Fine-tuning requires only generation. At very high volume (millions of queries/day), that difference matters. Fine-tuned smaller models can also match GPT-4 quality on narrow tasks at a fraction of the cost.

# Fine-tuned gpt-4o-mini for SQL generation
# vs. RAG + gpt-4o for same task
# Cost difference at 1M queries/day: ~$800/day vs ~$120/day

✅ You have high-quality labeled examples (50+)

Fine-tuning requires (prompt, ideal_response) pairs. If you've already logged thousands of correct interactions, or have domain experts who can label examples, that's the signal you need.

✅ The task requires reasoning patterns, not facts

Teaching a model how to think about a problem (legal reasoning, medical differential diagnosis, financial analysis frameworks) is better done through fine-tuning than RAG. You're not injecting facts — you're adjusting the reasoning process.


When to Use Both

The most powerful production systems combine both. This is called Fine-Tuned RAG or Retrieval-Augmented Fine-Tuning (RAFT).

Fine-Tuning handles:          RAG handles:
- Output format               - Current facts
- Domain reasoning style      - Specific document retrieval
- Consistent tone             - Source citation
- Task-specific behavior      - Knowledge updates

Real-world example — Cursor (AI code editor): - Fine-tuned on code understanding, editing patterns, and diff formats - RAG over your local codebase for file-specific context

Real-world example — Medical AI assistant: - Fine-tuned on clinical reasoning patterns and medical note formats - RAG over current drug databases, clinical guidelines, and patient records

Implementation pattern:

def fine_tuned_rag_query(user_question: str) -> str:
    # Step 1: Retrieve relevant context (RAG)
    context = retrieve_context(user_question)

    # Step 2: Query fine-tuned model with retrieved context
    response = client.chat.completions.create(
        model="ft:gpt-4o-mini:your-org:your-model-id",  # fine-tuned model
        messages=[
            {"role": "system", "content": DOMAIN_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"}
        ]
    )
    return response.choices[0].message.content

Cost & Complexity Analysis

RAG Cost Profile

Component One-time Ongoing
Embedding documents $5–50 (1M tokens) Per update
Vector DB hosting $70–700/mo (Pinecone) or free (self-hosted)
Inference (per query) ~$0.002–0.01/query
Engineering setup 2–5 days Low maintenance

Fine-Tuning Cost Profile

Component One-time Ongoing
Data preparation 1–4 weeks Per retrain
Training run $50–500 (small model) Per retrain
Evaluation 1–2 weeks Per retrain
Inference (per query) ~30–50% cheaper than base
Engineering setup 3–8 weeks Medium maintenance

Break-even rule of thumb: Fine-tuning starts making financial sense when you have >500K queries/month on a well-defined task, AND the task is stable enough that you won't need frequent retraining.


Decision Framework

Is your knowledge dynamic (changes weekly or more)?
  └─ Yes  RAG

Do you need source citations?
  └─ Yes  RAG

Is your primary problem behavior/style/reasoning consistency?
  └─ Yes  Fine-Tuning

Do you have 50+ high-quality labeled (prompt, response) pairs?
  └─ No  RAG (you're not ready for fine-tuning)
  └─ Yes → Fine-Tuning is viable

Is latency/cost critical at >500K queries/month?
  └─ Yes → Consider Fine-Tuning or Fine-Tuned RAG

Are you still iterating on the product?
  └─ Yes → RAG (faster to change)
  └─ No, task is stable → Fine-Tuning

Do you need both domain behavior AND current knowledge?
  └─ Yes → Fine-Tuned RAG

Default recommendation: Start with RAG. It's faster, cheaper to iterate, and solves 80% of use cases. Add fine-tuning only when you have a stable task, quality training data, and a clear gap that RAG can't close.


Implementation Quickstart

RAG in 30 minutes (Chroma + OpenAI)

pip install chromadb openai
import chromadb
from openai import OpenAI

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge-base")

def add_documents(docs: list[str]):
    embeddings = [
        client.embeddings.create(model="text-embedding-3-small", input=d).data[0].embedding
        for d in docs
    ]
    collection.add(
        documents=docs,
        embeddings=embeddings,
        ids=[f"doc_{i}" for i in range(len(docs))]
    )

def query(question: str) -> str:
    q_emb = client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding
    results = collection.query(query_embeddings=[q_emb], n_results=3)
    context = "\n".join(results["documents"][0])
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer only from the context provided."},
            {"role": "user", "content": f"Context: {context}\n\nQ: {question}"}
        ]
    )
    return resp.choices[0].message.content

Fine-Tuning Checklist

  • [ ] Collect 50–1000 (prompt, ideal_response) examples
  • [ ] Ensure examples cover edge cases, not just easy ones
  • [ ] Format as JSONL with messages array (system, user, assistant)
  • [ ] Hold out 10–20% as validation set
  • [ ] Run fine-tune job (OpenAI, Together AI, or self-hosted with Axolotl)
  • [ ] Evaluate on validation set — compare to base model
  • [ ] A/B test in production with 5–10% traffic split
  • [ ] Set up retraining pipeline for when data drifts

Common Mistakes

RAG mistakes: - Chunks too large — 500–1000 tokens per chunk is usually optimal. Larger chunks dilute relevance. - No metadata filtering — Always filter by date, category, or source before vector search. - Skipping re-ranking — Use a cross-encoder to re-rank retrieved chunks before passing to the LLM. - Ignoring chunking strategy — Sentence-based chunking often beats fixed-size for prose documents.

Fine-tuning mistakes: - Too little data — Under 50 examples rarely produces meaningful improvement. - Low-quality examples — 100 excellent examples beat 1,000 mediocre ones. Every time. - Forgetting catastrophic forgetting — Fine-tuning can degrade general capability. Test broadly, not just on your task. - No evaluation set — Without held-out validation, you can't tell if fine-tuning actually helped. - Fine-tuning when prompt engineering would suffice — Try a well-crafted few-shot prompt first. You might not need fine-tuning at all.


Resources


Found this useful? ⭐ Star the repo and share it with your team. Have a use case or mistake I missed? Open an issue or submit a PR.