Introduction

I sometimes lack enough labeled data to train or test ML models, whether it’s rare edge-case examples or balanced class distributions. To fill gaps, I use a local LLM to generate synthetic data: text samples, structured records, or paired examples for tasks like classification or QA. Here’s how I set it up.

Why Generate Synthetic Data Locally

  • Data Control: I decide what patterns to augment.
  • Privacy: No sensitive patterns leave my machine.
  • Cost Efficiency: Generate thousands of samples with no API billing.

Pipeline Overview

  1. Define Templates & Prompts – Write prompts that describe the data you need.
  2. Generation Loop – Use a local LLM to produce samples in batches.
  3. Post-Processing & Validation – Filter or validate outputs (regex checks, simple classifiers).
  4. Integration with Training Data – Merge synthetic samples with real data for model training.

I’ll expand on each step with code and my own notes on pitfalls.


1. Defining Templates & Prompts

Depending on data type:

  • Text Classification: Prompt to generate examples labeled with a category:
Prompt: "Generate 5 examples of customer support emails categorized as {{ label }}. Output as JSON list of {\"text\":..., \"label\":...}."
  • Structured Data: Generate CSV rows:
Prompt: "Produce 10 CSV rows with columns 'age', 'income', 'purchased' where 'purchased' is Yes or No, realistic distributions."
  • QA Pairs: Generate question-answer pairs for a given context:
Prompt: "Based on the text: '{{ context }}', generate 5 QA pairs as JSON." 

Writing clear instructions upfront saves me debugging cycles.


2. Generation Loop

I batch prompts to generate large volumes.

from transformers import pipeline

generator = pipeline('text-generation', model='gpt2-medium', device=0)

def gen_data(prompt, batch_size=5):
    outputs = generator(
        prompt,
        max_length=200,
        num_return_sequences=batch_size,
        do_sample=True,
        top_p=0.95
    )
    return [o['generated_text'] for o in outputs]

# Example: classification examples
prompt = "Generate 5 examples of product reviews labeled Positive."  
samples = gen_data(prompt, batch_size=5)
print(samples)

I play with sampling parameters (top_p, temperature) to get diverse outputs.


3. Post-Processing & Validation

Generated text can be messy. I run quick filters:

import re

# For classification, enforce presence of label field
def validate_json_samples(samples):
    valid = []
    for s in samples:
        try:
            data = json.loads(s)
            if 'label' in data:
                valid.append(data)
        except:
            continue
    return valid

cleaned = validate_json_samples(samples)
print(f"Kept {len(cleaned)}/{len(samples)} samples.")

For numeric fields, I check with regex or type casts to ensure data integrity.


4. Integration with Training Data

I append synthetic records to my real dataset, then shuffle and retrain.

import pandas as pd

real_df = pd.read_csv('real_data.csv')
synth_df = pd.DataFrame(cleaned)

combined = pd.concat([real_df, synth_df])
combined = combined.sample(frac=1).reset_index(drop=True)
combined.to_csv('augmented_data.csv', index=False)

I monitor model performance on a held-out real validation set to ensure synthetic data helps, not hurts.


Wrapping Up

That’s my go-to synthetic data workflow: template, generate, validate, and integrate. Next up: building a private knowledge management system backed by vector search and local LLMs.