37.7459° N
119.5332° W
HALF DOME
ELEV 8839 FT
SHEET 01 / 04
0.964
ICC(3,1) consistencyvs ≥ 0.75 (Koo & Li)
0.883
QWK on letter gradesvs ≥ 0.70 (AES standard)
96%
Within-1-point agreement808 of 840 items
600+
Exams gradedacross 380 students
01
Product FERPA

Grading that was slow, inconsistent, and a privacy risk

The_problem

As a Graduate Student Instructor at Haas, I was grading 100+ handwritten exams a term: slow, inconsistent across a stack, and impossible to standardize by hand. The obvious fix, an off-the-shelf AI grader, meant uploading identifiable student work to a third-party tool, which is exactly what FERPA-conscious instructors can't do. The user here is the instructor, and the real constraint wasn't accuracy, it was privacy: any solution had to keep student identities on the instructor's own machine.

Key_decisions

Decompose the rubric instead of reaching for a bigger model. LLMs degrade on holistic grading as a rubric gets more granular; Rubrica splits each rubric into structured per-question partial-credit tiers, which mitigates that degradation (a publishable finding, with a strong negative correlation between granularity and holistic accuracy, r = −0.882). A related discipline fell out of it: models reliably start dropping instructions past roughly 30 directives, so prompts are split into multi-step pipelines rather than one giant prompt.

Make privacy structural, not promised. Anonymization and name-mapping happen locally; identifying data never leaves the instructor's machine, so the privacy guarantee is in the architecture rather than a vendor's policy. And whatever checks a grade is independent of what produced it: a primary Claude grader, an audit model from a different family that re-checks a stratified sample, and escalation to a stronger cross-family model on genuinely ambiguous items, so no single model's blind spot goes unchecked.

How_it_works

Rubrica reads scanned exams, anonymizes each one with a random ID, and grades against a rubric that has been decomposed into structured per-question partial-credit tiers with common error patterns and expected methods. Each criterion is scored independently by Claude Sonnet 4.6's vision API, then mapped back to the original student locally so identifying data never leaves the instructor's machine.

After grading, the system generates class performance diagnostics that surface which concepts the cohort missed, where partial credit clustered, and which rubric items showed the largest cross-model disagreement, useful signal for calibrating the next iteration.

Validation

Rubrica has been validated across microeconomics, finance, and data & decisions courses at UC Berkeley Haas. The most recent independent inter-rater audit covered 30 exams (840 scored items) from an undergraduate economics course. Each exam was re-graded by o4-mini as a cross-family reference scorer, eliminating shared training data and architectural biases as confounds. The production model hit ICC of 0.964 and QWK of 0.883, both well above established psychometric standards (Koo & Li 2016; Williamson et al. 2012). Mean absolute error: 0.15 pts per question. A small consistent +0.64-point generosity bias per exam is caught automatically by a boundary re-grading safeguard that triggers a second independent grading pass on any exam scoring within ±1.5% of the 90/80/70/60 letter-grade cutoffs.

Safeguards

Beyond boundary re-grading, the pipeline runs MC double-read verification on zero-scored multiple-choice items, contradiction detection between feedback and scores, feedback specificity enforcement on vague comments, handwriting confidence flagging for ambiguous pages, and a review-flag gate that prevents finalization until an instructor dismisses each unresolved flag. A pre-safeguard raw-score snapshot is preserved for apples-to-apples cross-family auditability.

Distribution

Finished grades don't stop at a spreadsheet. Rubrica generates a per-student PDF report (score card, question breakdown, rubric-anchored feedback) and emails it from the instructor's own Gmail through a one-click Connect Gmail button, with no Google Cloud Console or developer account to set up. Mail sends locally from the instructor's account, so student PII never routes through a server; delivery is tracked per recipient, and there's a disconnect-and-revoke path.

Desktop_app

Rubrica now ships as a standalone macOS app: a Tauri shell wrapping the Flask backend, with templates and SQLite frozen into a single bundle. A non-technical instructor installs a .dmg and runs it like any other Mac app, with no terminal, virtualenv, or hand-edited config file. The Anthropic API key and on-device OCR model are entered in the app's Settings on first launch.

Access

Because grades and rosters are sensitive, the app gates access behind a one-click Sign in with Google (OpenID Connect); only allowlisted identities can open a workspace. The first sign-in claims ownership, and the owner adds or removes GSIs by email from Settings, so a teaching team shares one instance without sharing credentials. ID tokens are verified (audience, issuer, expiry, verified email) before any session starts, and a nav Lock button plus a per-launch boot id force a fresh sign-in every time the app opens.

Stack
Python / FlaskClaude Sonnet 4.6o4-mini (audit)Ollama (local PII)SQLiteTauri desktopGoogle OAuthGmail APIFERPA