Projects

Projects

A selection of things I’ve built — mostly in the intersection of AI engineering, security and backend infrastructure. Almost all of them started as a problem I had personally and turned into something useful enough to keep around.


DemoFlow — AI-powered interactive product demos

Live at www.demoflow.dev

Upload screenshots, Claude Vision writes the captions and hotspot positions, share a link — or export a self-contained HTML file that works offline forever, no DemoFlow dependency. Branching flows, not linear slideshows; password-protected demos; per-screen analytics.

Stack: Python · FastAPI · React · TypeScript · Tailwind · Supabase (Postgres + Auth + Storage + RLS) · Stripe (subscriptions, webhooks, customer portal) · Resend · Railway · Vercel · Docker · Claude Haiku 4.5 Vision

What’s interesting about it:

  • Self-contained HTML export with inlined CSS + vanilla JS — XSS-safe, no CDN, works in ten years with no DemoFlow servers running.
  • Full REST API + zero-dependency Node.js CLI, so you can drive it from a CI pipeline.
  • Row-level security in Postgres, API-key auth (SHA-256 hashed), bcrypt passwords, rate limiting per IP and per user.
  • Parallel uploads with a two-pass Pillow image compression pipeline.

EvalPriv — Self-hosted AI gateway with PII interception

A proxy you drop in front of your LLM traffic. One-line SDK change and every prompt leaving your network gets scanned for PII — emails, phone numbers, SSNs, Luhn-validated credit cards, ISO-13616-validated IBANs — with configurable redact / block / log modes per type.

Stack: Ruby on Rails 8.1 (API-only) · React/Vite (zero UI deps, custom dark theme) · SQLite · Active Record Encryption (AES-256-GCM) · Solid Queue · Docker

Other things it does:

  • Live feed of every request with model, latency, tokens, cost, PII alerts.
  • Quick Eval: send a prompt to N models simultaneously, compare responses side by side.
  • Quality judge: appoint any model as judge, auto-score responses 1–10 with reasoning.
  • Supports OpenAI, Anthropic, Gemini, any OpenAI-compatible endpoint, and Ollama locally.

Why I built it: AI governance is the concern security teams in regulated industries actually raise. I wanted a concrete artefact showing what the controls could look like in practice.


ModelArena — Head-to-head AI model benchmarking

Pairwise benchmarking for LLMs. Every model pair competes on identical prompts; A/B labels are randomly swapped per comparison to kill position bias in the judge. The output is a colour-coded win-rate matrix — who beats who on each task category.

Stack: Ruby on Rails 8.1 · React/Vite · SQLite · Active Record Encryption · Solid Queue · Docker

Design choices worth noting:

  • A and B are called in parallel per comparison (~2× faster than sequential).
  • Error disqualification: a model that errors can never “win” — the working response wins automatically.
  • Five built-in task suites (Code Gen, Code Review, Debugging, System Design, Technical Explanation) plus custom prompts added from the UI.
  • Switch prediction: before swapping models in production, see the expected quality impact from historical data.

AI-Augmented Zettelkasten — PKM with a Claude co-pilot

My personal knowledge base, Obsidian-backed, with Claude wired in as an integrated thinking partner through a custom MCP server. Deliberate separation of concerns: the AI handles retrieval and consistency at scale; the human handles emergent discovery through the graph view.

Stack: Obsidian · Claude API · MCP server integration · structured markdown schema · custom prompt engineering

What the MCP layer actually does:

  • Gives Claude live vault access for consistency checking across atomic notes.
  • Suggests bidirectional link candidates on new notes.
  • Surfaces cross-domain idea collisions — the moment Zettelkasten was invented for.
  • Encodes a maturity progression model (seedling → evergreen) so notes have a lifecycle.

Resume Tailor — AI pipeline for targeted résumés

A local CLI that reads a master profile, fetches a job description from any URL (including JS-rendered ATS platforms), runs a keyword gap analysis, and generates a tailored DOCX + PDF with an ATS keyword coverage report.

Stack: Python · Anthropic Claude API · python-docx · Playwright · httpx

This one writes the résumés I actually send out. The Claude call is the interesting part — it’s a careful prompt that preserves factual accuracy while reshaping emphasis toward the role.


What’s next

I have half-finished projects around agent observability, autonomous security-research pipelines, and a small language for describing AI evaluation rubrics. If any of them mature into something useful, they’ll show up here — and get a write-up on the blog.

Want to talk about any of this? I’m on GitHub and LinkedIn, or just email me.

Trending Tags