Introduction

I’ve found Retrieval-Augmented Generation (RAG) to be a game-changer for making my local LLMs feel smarter—especially when dealing with niche documents or domain-specific corpora. Instead of expecting the base model to memorize every detail, I let it retrieve the most relevant passages on the fly and generate answers grounded in that context. Here’s how I set up my own RAG pipeline end-to-end.

Why I Use RAG

  • Updated Knowledge: I can swap in new documents or data anytime, without retraining the model.
  • Smaller Models Work: Even smaller LLMs shine when they have up-to-date context at generation time.
  • Explainability: I can see which snippets the model uses to answer, which helps build trust.

Pipeline Components

  1. Document Ingestion & Chunking – Load data (PDFs, markdown, etc.) and break into chunks.
  2. Embedding Storage – Generate embeddings and store them in a vector DB (pgvector, FAISS, etc.).
  3. Retriever – Given a query, find top-k relevant chunks.
  4. LLM Generator – Pass query + retrieved chunks to the local LLM for response.

1. Document Ingestion & Chunking

I usually start by writing a simple script to walk a directory of files, extract text (e.g., with pdfplumber for PDFs, markdown package for MD), and then chunk each file.

import os
from pdfplumber import open as pdf_open
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def extract_text_from_pdf(path):
    text = []
    with pdf_open(path) as pdf:
        for page in pdf.pages:
            text.append(page.extract_text())
    return "\n".join(text)

# Walk directory and chunk
all_chunks = []
for root, dirs, files in os.walk('docs/'):
    for fname in files:
        if fname.endswith('.pdf'):
            text = extract_text_from_pdf(os.path.join(root, fname))
        elif fname.endswith('.md'):
            text = open(os.path.join(root, fname)).read()
        else:
            continue
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
print(f"Total chunks from ingestion: {len(all_chunks)}")

I keep chunks in a list for quick testing, then batch-insert to my vector store.


2. Embedding Storage

Why: Separate storage means I can rebuild embeddings if I swap models.

# Continuing from previous docs
embeddings = embed_chunks(all_chunks)

# For pgvector + psycopg2
for chunk, emb in zip(all_chunks, embeddings.tolist()):
    cur.execute(
        "INSERT INTO docs (content, embedding) VALUES (%s, %s);",
        (chunk, emb)
    )
conn.commit()

If I switch to FAISS, I’d do something like:

import faiss
import numpy as np

d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(np.array(embeddings))  # embeddings is an ndarray
# Save index
faiss.write_index(index, 'faiss.index')

3. Retriever

Why: I want a quick way to go from user query to context chunks. My Trick: Cache the top-k chunk IDs in a tiny SQLite LRU so repeated queries hit the local store faster.

def retrieve(query, k=5):
    q_emb = get_embedding(query)
    # For pgvector
    cur.execute(
        "SELECT content FROM docs ORDER BY embedding <-> %s::vector LIMIT %s;",
        (q_emb, k)
    )
    results = [row[0] for row in cur.fetchall()]
    return results

# Example
ctx = retrieve("What are the benefits of hosting locally?", k=3)
print("Retrieved contexts:", ctx)

I found k=3 works well, but sometimes k=5 gives extra context at a small latency cost.


4. LLM Generator

Finally, I merge the query with retrieved contexts in a prompt template and feed it to my local LLM.

from transformers import pipeline

gen = pipeline('text-generation', model='gpt2', device=0)

prompt_template = '''
You are an AI assistant with access to the following context:

Context:
{context}

Question: {query}
Answer in a concise, factual manner.
'''  

def generate_answer(query, context_chunks):
    context = "\n\n".join(context_chunks)
    prompt = prompt_template.format(context=context, query=query)
    return gen(prompt, max_length=200, do_sample=False)[0]['generated_text']

answer = generate_answer("How does pgvector integrate with Postgres?", ctx)
print(answer)

I’ve experimented with few-shot examples in the prompt, but keeping it single-shot reduces prompt engineering overhead.


Wrapping Up

My RAG pipeline now lives as a single CLI script. I can drop new docs into docs/, rerun ingestion, and my local assistant immediately knows about them—no retraining needed. Next, I’ll look at combining RAG with a simple Flask web interface so I can interact with it in the browser.