3 minutes
Multi-Modal Vision-Language Search with Local Models
Introduction
I’ve got folders full of screenshots, design mockups, and asset images. Hunting for a specific image by filename or folder structure is a pain. By combining a local image encoder with my text LLM and pgvector, I can search my image library with natural language (or even do reverse image search), all on my laptop.
Why Multi-Modal Search?
- No Cloud Needed: All image and text embeddings stay local.
- Flexible Queries: Search by description, color, or even concept (“show me the mobile UI screens”).
- Unified DB: Keep image metadata and embeddings in Postgres alongside docs.
Pipeline Overview
- Image Embedding – Use a local CLIP or similar to embed images.
- Text Embedding – Use
all-MiniLM-L6-v2
or a CLIP text encoder to embed queries. - Storage – Store both image and text embeddings in pgvector.
- Querying – Compute similarity between query embed and image embeds, retrieve top matches.
I’ll show how I wired these pieces together.
1. Generating Image Embeddings
I picked OpenAI’s CLIP (openai/clip-vit-base-patch32
) from Hugging Face, since it’s versatile for images and text.
pip install transformers pillow torchvision
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import torch
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(path):
img = Image.open(path).convert('RGB')
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
outputs = model.get_image_features(**inputs)
return outputs.squeeze().tolist()
# Example
img_emb = embed_image('screenshots/ui_home.png')
print(f"Image embedding length: {len(img_emb)}")
I batch-process my images/
folder and save embeddings.
2. Generating Text Embeddings
For a text query or caption, I use the corresponding CLIP text encoder so search space is consistent.
from transformers import CLIPTokenizer
tokenizer = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Actually, CLIPProcessor bundles tokenizer; reuse it for text
def embed_text(text):
inputs = processor(text=[text], return_tensors='pt', padding=True)
with torch.no_grad():
outputs = model.get_text_features(**inputs)
return outputs.squeeze().tolist()
# Example
txt_emb = embed_text("a mobile app home screen with blue header")
print(f"Text embedding length: {len(txt_emb)}")
Ensuring both embeddings are length 512, I knew they’d be comparable in the same vector space.
3. Storing Embeddings in Postgres
I extend my docs
table to store images:
ALTER TABLE documents
ADD COLUMN img_path TEXT,
ADD COLUMN img_embedding VECTOR(512);
Batch insert via Python:
for path in image_paths:
emb = embed_image(path)
cur.execute(
"INSERT INTO documents (img_path, img_embedding) VALUES (%s, %s);",
(path, emb)
)
conn.commit()
I keep content
NULL for image rows, or vice versa for text-only docs.
4. Querying the Multi-Modal Store
Text Query: Embed text, then query image embeddings:
-- SQL
SELECT img_path
FROM documents
WHERE img_embedding IS NOT NULL
ORDER BY img_embedding <-> '[...query emb...]'::vector
LIMIT 5;
Image Query (Reverse Search): Embed input image and query similar:
q_emb = embed_image('query.png')
cur.execute(
"SELECT img_path FROM documents ORDER BY img_embedding <-> %s::vector LIMIT 5;",
(q_emb,)
)
results = [r[0] for r in cur.fetchall()]
I wrap both in a Python function search_images(query, by='text', k=5)
so I can play in the REPL.
Wrapping Up
Now I just type search_images("dark mode login screen")
in my console, and it spits back paths to screenshots. Next, I’ll build a tiny Streamlit app so I can browse images in a web UI.