3 minutes
Synthetic Data Generation with Local LLMs
Introduction
I sometimes lack enough labeled data to train or test ML models, whether it’s rare edge-case examples or balanced class distributions. To fill gaps, I use a local LLM to generate synthetic data: text samples, structured records, or paired examples for tasks like classification or QA. Here’s how I set it up.
Why Generate Synthetic Data Locally
- Data Control: I decide what patterns to augment.
- Privacy: No sensitive patterns leave my machine.
- Cost Efficiency: Generate thousands of samples with no API billing.
Pipeline Overview
- Define Templates & Prompts – Write prompts that describe the data you need.
- Generation Loop – Use a local LLM to produce samples in batches.
- Post-Processing & Validation – Filter or validate outputs (regex checks, simple classifiers).
- Integration with Training Data – Merge synthetic samples with real data for model training.
I’ll expand on each step with code and my own notes on pitfalls.
1. Defining Templates & Prompts
Depending on data type:
- Text Classification: Prompt to generate examples labeled with a category:
Prompt: "Generate 5 examples of customer support emails categorized as {{ label }}. Output as JSON list of {\"text\":..., \"label\":...}."
- Structured Data: Generate CSV rows:
Prompt: "Produce 10 CSV rows with columns 'age', 'income', 'purchased' where 'purchased' is Yes or No, realistic distributions."
- QA Pairs: Generate question-answer pairs for a given context:
Prompt: "Based on the text: '{{ context }}', generate 5 QA pairs as JSON."
Writing clear instructions upfront saves me debugging cycles.
2. Generation Loop
I batch prompts to generate large volumes.
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2-medium', device=0)
def gen_data(prompt, batch_size=5):
outputs = generator(
prompt,
max_length=200,
num_return_sequences=batch_size,
do_sample=True,
top_p=0.95
)
return [o['generated_text'] for o in outputs]
# Example: classification examples
prompt = "Generate 5 examples of product reviews labeled Positive."
samples = gen_data(prompt, batch_size=5)
print(samples)
I play with sampling parameters (top_p
, temperature
) to get diverse outputs.
3. Post-Processing & Validation
Generated text can be messy. I run quick filters:
import re
# For classification, enforce presence of label field
def validate_json_samples(samples):
valid = []
for s in samples:
try:
data = json.loads(s)
if 'label' in data:
valid.append(data)
except:
continue
return valid
cleaned = validate_json_samples(samples)
print(f"Kept {len(cleaned)}/{len(samples)} samples.")
For numeric fields, I check with regex or type casts to ensure data integrity.
4. Integration with Training Data
I append synthetic records to my real dataset, then shuffle and retrain.
import pandas as pd
real_df = pd.read_csv('real_data.csv')
synth_df = pd.DataFrame(cleaned)
combined = pd.concat([real_df, synth_df])
combined = combined.sample(frac=1).reset_index(drop=True)
combined.to_csv('augmented_data.csv', index=False)
I monitor model performance on a held-out real validation set to ensure synthetic data helps, not hurts.
Wrapping Up
That’s my go-to synthetic data workflow: template, generate, validate, and integrate. Next up: building a private knowledge management system backed by vector search and local LLMs.