LLM Fine-Tuning vs RAG: When to Use Which
Posted on Sat 18 April 2026 in GenAI
LLM Fine-Tuning vs RAG: When to Use Which
A practical decision framework for teams building with LLMs — with real trade-offs, cost analysis, and when to combine both
Table of Contents
- The Core Question
- What Is RAG?
- What Is Fine-Tuning?
- Head-to-Head Comparison
- When to Choose RAG
- When to Choose Fine-Tuning
- When to Use Both
- Cost & Complexity Analysis
- Decision Framework
- Implementation Quickstart
- Common Mistakes
- Resources
The Core Question
You're building an AI product. Your LLM doesn't know your data, your domain, or your tone. How do you fix that?
Two approaches dominate:
- RAG (Retrieval-Augmented Generation): Give the model relevant information at query time by retrieving it from a knowledge base.
- Fine-Tuning: Re-train the model on your data so the knowledge is baked into the weights.
Both work. Both have real trade-offs. Picking the wrong one costs months and thousands of dollars. This guide gives you a clear framework for deciding.
What Is RAG?
RAG keeps the base model frozen and dynamically injects relevant context at inference time.
User Query
↓
[Embed query] → [Search vector DB] → [Retrieve top-k chunks]
↓
[Augmented prompt: retrieved chunks + original query]
↓
LLM generates answer grounded in retrieved context
The pipeline:
from openai import OpenAI
from pinecone import Pinecone
client = OpenAI()
pc = Pinecone(api_key="YOUR_KEY")
index = pc.Index("knowledge-base")
def rag_query(user_question: str) -> str:
# 1. Embed the question
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=user_question
).data[0].embedding
# 2. Retrieve relevant chunks
results = index.query(vector=embedding, top_k=5, include_metadata=True)
context = "\n\n".join([r.metadata["text"] for r in results.matches])
# 3. Generate grounded answer
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using only the provided context. If the answer isn't in the context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"}
]
)
return response.choices[0].message.content
What Is Fine-Tuning?
Fine-tuning continues training a pre-trained model on your dataset, updating its weights to encode new knowledge, style, or behavior.
Base Model (frozen knowledge)
↓
[Your training data: (prompt, ideal_response) pairs]
↓
[Gradient updates via supervised learning]
↓
Fine-Tuned Model (knowledge baked into weights)
Training data format (OpenAI JSONL):
{"messages": [{"role": "system", "content": "You are a support agent for Acme SaaS."}, {"role": "user", "content": "How do I reset my API key?"}, {"role": "assistant", "content": "To reset your API key: go to Settings → API → Regenerate Key. Your old key is immediately invalidated."}]}
{"messages": [{"role": "system", "content": "You are a support agent for Acme SaaS."}, {"role": "user", "content": "What's the rate limit on the free plan?"}, {"role": "assistant", "content": "Free plan: 100 requests/minute, 10,000 requests/month. Upgrade to Pro for 1,000 req/min."}]}
Launching a fine-tune (OpenAI):
from openai import OpenAI
client = OpenAI()
# Upload training file
with open("training_data.jsonl", "rb") as f:
file = client.files.create(file=f, purpose="fine-tune")
# Start fine-tune job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18"
)
print(f"Fine-tune job started: {job.id}")
# Monitor: client.fine_tuning.jobs.retrieve(job.id)
Head-to-Head Comparison
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Real-time — just update the DB | Requires retraining (hours/days) |
| Data freshness | Always current | Stale until retrained |
| Setup complexity | Medium (pipeline + vector DB) | High (data prep + training loop) |
| Cost to update | Low (upsert new docs) | High (full training run) |
| Inference cost | Higher (embedding + retrieval + generation) | Lower (just generation) |
| Handles new facts | ✅ Excellent | ❌ Needs retraining |
| Changes model behavior/style | ❌ Limited | ✅ Excellent |
| Reduces hallucination | ✅ Strong (grounded in retrieved text) | ⚠️ Moderate |
| Data requirements | Documents/chunks | 50–1000+ (prompt, response) pairs |
| Transparency | High (can cite sources) | Low (black box) |
| Privacy | Data stays in your DB | Data sent to training provider |
| Time to production | Days | Weeks |
When to Choose RAG
RAG is the right default for most teams. Choose it when:
✅ Your knowledge changes frequently
News, product documentation, pricing, inventory, policy — anything that updates weekly, daily, or in real time. Retraining a model every time your docs change is impractical. RAG lets you update your knowledge base and the model immediately reflects it.
Good RAG use cases:
- Internal company knowledge base assistant
- Customer support bot with evolving product docs
- Legal document Q&A (regulations change)
- E-commerce catalog search & Q&A
- News summarization / research assistant
✅ You need source citations
RAG retrieves specific chunks — you always know which document the answer came from. This is essential for compliance, legal, and medical contexts where "the AI told me" isn't sufficient.
✅ You have large volumes of long-tail knowledge
A model can't memorize 50,000 support articles. RAG surfaces the right 3 at query time. Fine-tuning on 50,000 articles would require enormous training data and still wouldn't guarantee retrieval of the right fact.
✅ You're prototyping or iterating fast
Stand up a RAG pipeline in a day. Fine-tuning takes weeks of data preparation, training, and evaluation. Ship with RAG, decide later if fine-tuning adds enough value.
✅ Reducing hallucinations is the priority
By forcing the model to answer from retrieved context, RAG significantly reduces hallucinations on factual questions. It's not perfect, but it's the most reliable grounding technique available today.
When to Choose Fine-Tuning
Fine-tuning earns its cost when RAG can't solve the problem.
✅ You need to change how the model behaves, not just what it knows
RAG adds context. Fine-tuning changes behavior. If you need the model to consistently write in your brand's voice, follow a specific output schema every time, or reason like a domain expert — fine-tuning is the lever.
Good fine-tuning use cases:
- Consistent brand/tone across all outputs
- Domain-specific reasoning (medical diagnosis, legal analysis)
- Structured output compliance (always return valid JSON schema)
- Code generation in your internal framework/style
- Language localization (dialect, formality level)
✅ You have a well-defined, stable task
Fine-tuning excels at narrow, repeated tasks with clear right answers. Classify this support ticket. Extract these fields from this document. Convert this natural language query to SQL.
✅ Latency and cost matter at scale
RAG requires an embedding call + vector search + generation. Fine-tuning requires only generation. At very high volume (millions of queries/day), that difference matters. Fine-tuned smaller models can also match GPT-4 quality on narrow tasks at a fraction of the cost.
# Fine-tuned gpt-4o-mini for SQL generation
# vs. RAG + gpt-4o for same task
# Cost difference at 1M queries/day: ~$800/day vs ~$120/day
✅ You have high-quality labeled examples (50+)
Fine-tuning requires (prompt, ideal_response) pairs. If you've already logged thousands of correct interactions, or have domain experts who can label examples, that's the signal you need.
✅ The task requires reasoning patterns, not facts
Teaching a model how to think about a problem (legal reasoning, medical differential diagnosis, financial analysis frameworks) is better done through fine-tuning than RAG. You're not injecting facts — you're adjusting the reasoning process.
When to Use Both
The most powerful production systems combine both. This is called Fine-Tuned RAG or Retrieval-Augmented Fine-Tuning (RAFT).
Fine-Tuning handles: RAG handles:
- Output format - Current facts
- Domain reasoning style - Specific document retrieval
- Consistent tone - Source citation
- Task-specific behavior - Knowledge updates
Real-world example — Cursor (AI code editor): - Fine-tuned on code understanding, editing patterns, and diff formats - RAG over your local codebase for file-specific context
Real-world example — Medical AI assistant: - Fine-tuned on clinical reasoning patterns and medical note formats - RAG over current drug databases, clinical guidelines, and patient records
Implementation pattern:
def fine_tuned_rag_query(user_question: str) -> str:
# Step 1: Retrieve relevant context (RAG)
context = retrieve_context(user_question)
# Step 2: Query fine-tuned model with retrieved context
response = client.chat.completions.create(
model="ft:gpt-4o-mini:your-org:your-model-id", # fine-tuned model
messages=[
{"role": "system", "content": DOMAIN_SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"}
]
)
return response.choices[0].message.content
Cost & Complexity Analysis
RAG Cost Profile
| Component | One-time | Ongoing |
|---|---|---|
| Embedding documents | $5–50 (1M tokens) | Per update |
| Vector DB hosting | — | $70–700/mo (Pinecone) or free (self-hosted) |
| Inference (per query) | — | ~$0.002–0.01/query |
| Engineering setup | 2–5 days | Low maintenance |
Fine-Tuning Cost Profile
| Component | One-time | Ongoing |
|---|---|---|
| Data preparation | 1–4 weeks | Per retrain |
| Training run | $50–500 (small model) | Per retrain |
| Evaluation | 1–2 weeks | Per retrain |
| Inference (per query) | — | ~30–50% cheaper than base |
| Engineering setup | 3–8 weeks | Medium maintenance |
Break-even rule of thumb: Fine-tuning starts making financial sense when you have >500K queries/month on a well-defined task, AND the task is stable enough that you won't need frequent retraining.
Decision Framework
Is your knowledge dynamic (changes weekly or more)?
└─ Yes → RAG
Do you need source citations?
└─ Yes → RAG
Is your primary problem behavior/style/reasoning consistency?
└─ Yes → Fine-Tuning
Do you have 50+ high-quality labeled (prompt, response) pairs?
└─ No → RAG (you're not ready for fine-tuning)
└─ Yes → Fine-Tuning is viable
Is latency/cost critical at >500K queries/month?
└─ Yes → Consider Fine-Tuning or Fine-Tuned RAG
Are you still iterating on the product?
└─ Yes → RAG (faster to change)
└─ No, task is stable → Fine-Tuning
Do you need both domain behavior AND current knowledge?
└─ Yes → Fine-Tuned RAG
Default recommendation: Start with RAG. It's faster, cheaper to iterate, and solves 80% of use cases. Add fine-tuning only when you have a stable task, quality training data, and a clear gap that RAG can't close.
Implementation Quickstart
RAG in 30 minutes (Chroma + OpenAI)
pip install chromadb openai
import chromadb
from openai import OpenAI
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge-base")
def add_documents(docs: list[str]):
embeddings = [
client.embeddings.create(model="text-embedding-3-small", input=d).data[0].embedding
for d in docs
]
collection.add(
documents=docs,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(docs))]
)
def query(question: str) -> str:
q_emb = client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
results = collection.query(query_embeddings=[q_emb], n_results=3)
context = "\n".join(results["documents"][0])
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer only from the context provided."},
{"role": "user", "content": f"Context: {context}\n\nQ: {question}"}
]
)
return resp.choices[0].message.content
Fine-Tuning Checklist
- [ ] Collect 50–1000 (prompt, ideal_response) examples
- [ ] Ensure examples cover edge cases, not just easy ones
- [ ] Format as JSONL with
messagesarray (system, user, assistant) - [ ] Hold out 10–20% as validation set
- [ ] Run fine-tune job (OpenAI, Together AI, or self-hosted with Axolotl)
- [ ] Evaluate on validation set — compare to base model
- [ ] A/B test in production with 5–10% traffic split
- [ ] Set up retraining pipeline for when data drifts
Common Mistakes
RAG mistakes: - Chunks too large — 500–1000 tokens per chunk is usually optimal. Larger chunks dilute relevance. - No metadata filtering — Always filter by date, category, or source before vector search. - Skipping re-ranking — Use a cross-encoder to re-rank retrieved chunks before passing to the LLM. - Ignoring chunking strategy — Sentence-based chunking often beats fixed-size for prose documents.
Fine-tuning mistakes: - Too little data — Under 50 examples rarely produces meaningful improvement. - Low-quality examples — 100 excellent examples beat 1,000 mediocre ones. Every time. - Forgetting catastrophic forgetting — Fine-tuning can degrade general capability. Test broadly, not just on your task. - No evaluation set — Without held-out validation, you can't tell if fine-tuning actually helped. - Fine-tuning when prompt engineering would suffice — Try a well-crafted few-shot prompt first. You might not need fine-tuning at all.
Resources
- OpenAI Fine-Tuning Guide
- LangChain RAG Documentation
- RAFT Paper — RAG + Fine-Tuning combined
- Axolotl — Open-source fine-tuning framework
- LlamaIndex — Production RAG framework
- Ragas — Evaluate your RAG pipeline
- Together AI — Affordable fine-tuning API
Found this useful? ⭐ Star the repo and share it with your team. Have a use case or mistake I missed? Open an issue or submit a PR.