4 minutes
Document Summarization Pipelines with Local LLMs
Introduction
I’m putting together this pipeline to help me quickly turn any long document, whether it’s a PDF report, a collection of articles, or internal meeting notes, into a concise summary I can skim in seconds. Instead of relying on external APIs (which can be slow, costly, or raise privacy concerns), I want everything to run locally on my own machine. This way I know exactly where my data lives, how fast summaries are generated, and I don’t have to worry about hitting API limits or leaking confidential info.
Why This Approach Works for Me
- Privacy First: All text and embeddings stay on disk or in my local database.
- Predictable Latency: No network hops, summaries come back in seconds.
- Cost Control: I pay only for electricity and hardware, not per-token API fees.
Overview of the Pipeline
I’ve broken the summarization flow into four main steps. Treat this like a recipe: follow each step in sequence, and you’ll end up with a neat summary at the end:
- Text Chunking – Break up big blobs of text into bite-sized pieces that fit into the LLM’s context window.
- Embedding Generation – Turn each chunk into a numerical vector that captures its meaning.
- Retrieval – Find the chunks most relevant to the summary task by doing a nearest-neighbor lookup in a vector store.
- Generation – Feed those top chunks into my local summarization model and stitch together the final summary.
I’ll go through each step with code snippets and tips.
1. Text Chunking
Why: LLMs can only process so many tokens at once. Trying to summarize a 10,000‑word document in one go will either get cut off or produce garbage. My Strategy: I split the text into overlapping windows, say 1,000 tokens each, with a 200‑token overlap so I don’t lose context at chunk boundaries.
# Personal chunking function
def chunk_text(text, max_tokens=1000, overlap=200):
tokens = tokenizer.tokenize(text)
chunks = []
for i in range(0, len(tokens), max_tokens - overlap):
window = tokens[i : i + max_tokens]
chunk_text = tokenizer.convert_tokens_to_string(window)
chunks.append(chunk_text)
return chunks
# Example usage:
raw_text = open('long_report.txt').read()
my_chunks = chunk_text(raw_text)
print(f"Created {len(my_chunks)} chunks.")
A quick sanity check I do: print the first two chunks to verify they overlap correctly.
for i, c in enumerate(my_chunks[:2]):
print(f"--- Chunk {i+1} ---\n{c[:200]}...\n")
2. Embedding Generation
Why: Embeddings let me compare textual similarity in vector space.
My Tool: I use the all-MiniLM-L6-v2
model from sentence-transformers because it’s fast and reasonably accurate.
from transformers import AutoTokenizer, AutoModel
import torch
# Load model once
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Function to embed a list of chunks
def embed_chunks(chunks):
inputs = tokenizer(chunks, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1) # mean pooling
return embeddings
# Generate embeddings for my chunks
embs = embed_chunks(my_chunks)
print("Embeddings shape:", embs.shape)
I usually check embs.shape
to confirm I have one vector per chunk. If something feels off, I’ll visualize a few with t-SNE later.
3. Retrieval via Vector Search
Why: I don’t need all chunks, just the ones most relevant to summarizing. My Setup: I keep embeddings in a PostgreSQL table using the pgvector extension. That way, I can run SQL queries and keep everything in one place.
-- Enable the extension (one-time setup)
CREATE EXTENSION IF NOT EXISTS vector;
-- Table schema
CREATE TABLE IF NOT EXISTS docs (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(384)
);
-- Insert embeddings (from Python side or psql)
INSERT INTO docs (content, embedding)
VALUES
('first chunk...', '[0.12, -0.45, ...]'::vector),
('second chunk...', '[0.03, 0.27, ...]'::vector);
-- Retrieve top 3 nearest chunks for a query embedding
SELECT content
FROM docs
ORDER BY embedding <-> '[0.12, -0.45, ...]'::vector
LIMIT 3;
From Python I call this via psycopg2
or asyncpg
, passing my query vector to fetch back the 3 most similar text chunks.
4. Generation with Local LLM
Why: Now that I have the most relevant context, I want the model to weave them into a coherent summary.
My Model: I usually go with facebook/bart-large-cnn
locally, it’s solid for summarization out of the box.
from transformers import pipeline
# Initialize summarizer once
generator = pipeline(
'summarization',
model='facebook/bart-large-cnn',
device=0 # GPU if available
)
# Summarize retrieved chunks
def summarize_chunks(chunks):
prompt = "\n\n".join(chunks)
summary = generator(
prompt,
max_length=250, # tweak as needed
min_length=50,
do_sample=False
)
return summary[0]['summary_text']
# Pull it all together
top_chunks = get_top_chunks(my_query_embedding, k=3)
final_summary = summarize_chunks(top_chunks)
print("My Summary:\n", final_summary)
A quick trick I use: if the summary feels too short or skips details, I increase max_length
or feed 5 chunks instead of 3.
Wrapping Up
Putting it all together, my summarization workflow is:
- Read document → chunk_text()
- chunk_text → embed_chunks()
- embed_chunks → store/query via pgvector
- top chunks → summarize_chunks()
This setup lives in a single Python file for me, so I can run python summarize.py report.pdf
and instantly get a summary printed or saved. Over time, I’ll tweak chunk sizes, embedding models, and summarization parameters to fit different document types. For now, this feels like a solid foundation for private, fast, and free (from API fees) summaries.