37.7568° N
119.5966° W
UPPER YOSEMITE FALLS
ELEV 6526 FT
SHEET 04 / 04
161
Adversarial attacksacross two red-team rounds
0
Private facts leakednothing private in context
0
Fabricationsgrounded or it says it can't
0
Conversations storednothing logged or persisted
04
Case_Study RAG

An AI that talks about me, on the open internet

The_problem

A chatbot about a person is a small product with two big failure modes, and both are worse than having no chatbot at all. It can hallucinate a credential, inventing a job, a number, or a degree that a recruiter then repeats. Or, fed the rich personal notes that would make it genuinely useful, it can leak something private under a determined jailbreak. My design question was how to put a helpful assistant about myself on the open internet without exposing either flank.

Key_decisions

Make the safety structural, by data minimization. Rather than hope a model refuses to leak, the guarantee is built in: the assistant draws only from a hand-curated, public-safe corpus, and the private knowledge base that corpus was filtered from is never wired into it. So there is simply nothing private in context to leak, no matter the prompt. The same choice handles hallucination: it answers from the corpus or it says it doesn't have that detail and points you to me. The same instinct, "build the guarantee into the architecture, not the prompt", runs through SiteProof and Rubrica too.

Store nothing. No conversation content is ever logged or persisted. The only thing written down is an abuse counter keyed by a salted, expiring hash, never a raw IP or session ID, so rate limiting never becomes surveillance.

Hardening

I red-teamed it across two rounds with 161 adversarial attacks: prompt injection, attempts to extract personal details, forced-hallucination and false-premise prompts, and scope abuse, at zero leaks and zero fabrications. Anything that surfaced got fixed and re-tested until the rounds came back clean. I also ran a separate four-agent white-box pentest against the whole deployed site and closed the gaps it found. Forged "the assistant previously said X" turns, reconstructed from client memory, are tagged as untrusted so they can't smuggle in false context.

How_it_works

It's a single Vercel serverless function calling Claude with a prompt-cached corpus, so repeat reads are far cheaper, backed by Upstash Redis only for the salted abuse counters. The reply streams token by token as plain text, with URLs stripped on the way out, and the corpus is never exposed as a fetchable web asset. No framework, no build step. Small surface, few moving parts, which is itself part of the security argument.

Stack
Vercel FunctionsClaude API (prompt-cached)Upstash RedisRAG / data minimization