Private Knowledge Management Systems with Local LLMs

Introduction

I’ve got a ton of scattered notes—scientific journals, personal research notes, bookmarked articles—and it’s a hassle to dig through folders or old notebooks. I decided to build a private knowledge management (PKM) system backed by local LLMs and vector search so I can query my own knowledge base like a chat. Everything stays on my laptop, and updates are instant.

Why a Local PKM System

Ownership & Privacy: My thoughts and confidential project notes never leave my machine.
Instant Recall: No more losing track of where I saved that snippet on Docker configs.
Conversational Access: I can ask natural-language questions and get concise answers or pointers.

Pipeline Overview

Note Ingestion & Preprocessing – Pull in text notes, markdown files, PDFs, and clean them.
Chunking & Embedding – Break notes into chunks and generate embeddings locally.
Storage in Vector Store – Use pgvector or FAISS to index embeddings.
Query Interface – Build a simple CLI or web UI that embeds queries, retrieves top chunks, and uses an LLM to synthesize answers.

I’ll outline each of these steps with code snippets and my personal tips.

1. Note Ingestion & Preprocessing

Why: Raw notes come in different formats.

import os
from pdfplumber import open as pdf_open

def extract_text(path):
    if path.endswith('.md'):
        return open(path, encoding='utf-8').read()
    elif path.endswith('.pdf'):
        with pdf_open(path) as pdf:
            return "\n".join(p.page.extract_text() for p in pdf.pages)
    return ''

# Walk through my notes directory
docs = []
for root, dirs, files in os.walk('my_notes/'):
    for file in files:
        text = extract_text(os.path.join(root, file))
        if text:
            docs.append((file, text))
print(f"Ingested {len(docs)} documents.")

I also run simple regex rules to strip out TODOs or personal reminders I don’t want in the KM system.

2. Chunking & Embedding

Why: Chunking helps fit content into context windows and makes retrieval precise.

from transformers import AutoTokenizer, AutoModel
import torch

# Reuse my embedding model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model     = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def chunk_and_embed(text, max_tokens=500, overlap=100):
    tokens = tokenizer.tokenize(text)
    embeds = []
    contents = []
    for i in range(0, len(tokens), max_tokens - overlap):
        chunk = tokenizer.convert_tokens_to_string(tokens[i:i+max_tokens])
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
        vec = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
        embeds.append(vec)
        contents.append(chunk)
    return contents, embeds

# Example for first doc
chunks, embeddings = chunk_and_embed(docs[0][1])
print(f"First doc split into {len(chunks)} chunks.")

I tag each chunk with its source filename so I can trace answers back.

3. Storage in Vector Store

Why: Fast similarity search is key for retrieval.

Using pgvector (Postgres):

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS notes (
  id SERIAL PRIMARY KEY,
  source TEXT,
  chunk TEXT,
  embedding VECTOR(384)
);

Insert via Python:

import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
for (src, text), embs in zip(docs, embeddings):
    for chunk, emb in zip(contents, embs):
        cur.execute(
            "INSERT INTO notes (source, chunk, embedding) VALUES (%s, %s, %s)",
            (src, chunk, emb)
        )
conn.commit()

For larger scales I switch to FAISS, but pgvector works great for thousands of chunks.

4. Query Interface

Why: I want to ask questions in natural language and get cohesive answers.

from transformers import pipeline

gen = pipeline('text-generation', model='facebook/bart-large-cnn', device=0)

def query_kb(question, k=5):
    q_emb = get_embedding(question)  # reuse embed function
    cur.execute(
        "SELECT chunk FROM notes ORDER BY embedding <-> %s::vector LIMIT %s;",
        (q_emb, k)
    )
    ctx_chunks = [r[0] for r in cur.fetchall()]
    prompt = f"Context:\n{'\n\n'.join(ctx_chunks)}\n\nQuestion: {question}\nAnswer concisely:"
    return gen(prompt, max_length=200)[0]['generated_text']

# Example
print(query_kb("How do I configure SSH agent forwarding?"))

I wrap this in a CLI tool with argparse, so kb ask "..." feels like chatting.

Wrapping Up

With this PKM setup, I can finally search my own brain reliably—seeing source references and getting synthesized answers. My next step is to add document versioning so I can track how my notes evolve over time.