Multi-Modal Vision-Language Search with Local Models

Introduction

I’ve got folders full of screenshots, design mockups, and asset images. Hunting for a specific image by filename or folder structure is a pain. By combining a local image encoder with my text LLM and pgvector, I can search my image library with natural language (or even do reverse image search), all on my laptop.

No Cloud Needed: All image and text embeddings stay local.
Flexible Queries: Search by description, color, or even concept (“show me the mobile UI screens”).
Unified DB: Keep image metadata and embeddings in Postgres alongside docs.

Pipeline Overview

Image Embedding – Use a local CLIP or similar to embed images.
Text Embedding – Use all-MiniLM-L6-v2 or a CLIP text encoder to embed queries.
Storage – Store both image and text embeddings in pgvector.
Querying – Compute similarity between query embed and image embeds, retrieve top matches.

I’ll show how I wired these pieces together.

1. Generating Image Embeddings

I picked OpenAI’s CLIP (openai/clip-vit-base-patch32) from Hugging Face, since it’s versatile for images and text.

pip install transformers pillow torchvision

from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import torch

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model     = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

def embed_image(path):
    img = Image.open(path).convert('RGB')
    inputs = processor(images=img, return_tensors='pt')
    with torch.no_grad():
        outputs = model.get_image_features(**inputs)
    return outputs.squeeze().tolist()

# Example
img_emb = embed_image('screenshots/ui_home.png')
print(f"Image embedding length: {len(img_emb)}")

I batch-process my images/ folder and save embeddings.

2. Generating Text Embeddings

For a text query or caption, I use the corresponding CLIP text encoder so search space is consistent.

from transformers import CLIPTokenizer

tokenizer = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Actually, CLIPProcessor bundles tokenizer; reuse it for text

def embed_text(text):
    inputs = processor(text=[text], return_tensors='pt', padding=True)
    with torch.no_grad():
        outputs = model.get_text_features(**inputs)
    return outputs.squeeze().tolist()

# Example
txt_emb = embed_text("a mobile app home screen with blue header")
print(f"Text embedding length: {len(txt_emb)}")

Ensuring both embeddings are length 512, I knew they’d be comparable in the same vector space.

3. Storing Embeddings in Postgres

I extend my docs table to store images:

ALTER TABLE documents
ADD COLUMN img_path TEXT,
ADD COLUMN img_embedding VECTOR(512);

Batch insert via Python:

for path in image_paths:
    emb = embed_image(path)
    cur.execute(
        "INSERT INTO documents (img_path, img_embedding) VALUES (%s, %s);",
        (path, emb)
    )
conn.commit()

I keep content NULL for image rows, or vice versa for text-only docs.

Text Query: Embed text, then query image embeddings:

-- SQL
SELECT img_path
FROM documents
WHERE img_embedding IS NOT NULL
ORDER BY img_embedding <-> '[...query emb...]'::vector
LIMIT 5;

Image Query (Reverse Search): Embed input image and query similar:

q_emb = embed_image('query.png')
cur.execute(
    "SELECT img_path FROM documents ORDER BY img_embedding <-> %s::vector LIMIT 5;",
    (q_emb,)
)
results = [r[0] for r in cur.fetchall()]

I wrap both in a Python function search_images(query, by='text', k=5) so I can play in the REPL.

Wrapping Up

Now I just type search_images("dark mode login screen") in my console, and it spits back paths to screenshots. Next, I’ll build a tiny Streamlit app so I can browse images in a web UI.