When we built Scripto — a GCSE language dictation practice app — the core loop sounded simple: generate a sentence, read it aloud, let the student write it on paper, take a photo, mark it. Four steps. Straightforward.

The last step turned out to be anything but.

The problem: real handwriting doesn't behave

Marking a dictation exercise by machine means comparing what a student wrote against the original sentence, word by word. The obvious approach is positional: compare word at index 0, then index 1, then index 2. If they match, it's correct. If not, it's wrong.

This breaks immediately the moment a student makes a realistic mistake.

Consider a French sentence: parce que tu es fatigué. A student, writing quickly on paper, might write parceque tu es fatigué — running the two words together. Now the written token count is 4, but the target token count is 5. Every word from that point on is off by one position. The student got four words right and merged two into one — but a positional comparison marks the entire rest of the sentence as wrong.

Or a student skips a word entirely. Same effect: a cascade of false errors for every word that follows.

These aren't edge cases. They're the most common things students do under pressure.

Why we couldn't just fix it in code

Classic sequence alignment (Levenshtein distance, Smith-Waterman, etc.) works at the character level, not the token level — and it doesn't understand language. It can tell you that parceque is close to parce and que combined, but it can't tell you why, or what the error means linguistically, or how to explain it to a 15-year-old student in plain English.

We also needed the assessment to handle:

  • Handwriting that GPT-4o has to read from a photo before it can even begin marking
  • Accents — é vs e is a real error in French; capitalisation is not
  • Explanations that are linguistically meaningful: "missing accent — should be é not e", not just "wrong"
  • A per-word result structure that the UI could render with inline red highlights

Trying to orchestrate all of that with handcrafted logic would have been a maintenance nightmare and would have produced worse results than what we ended up with.

The solution: let the model handle alignment

We gave the alignment problem entirely to GPT-4o. The assessment prompt instructs the model to:

  1. Read exactly what is written in the photo — transcribe it faithfully, not charitably
  2. Align the transcription against the target sentence using sequence alignment, not positional matching
  3. Handle merged words explicitly: if a student writes parceque, flag it as two errors (one for each target word it replaced), not one
  4. Return a structured per-word array: each target word with a status (correct, incorrect, or missing) and a plain-English explanation of any error

The model handles the full alignment. It understands that parceque is a collision of parce and que, that a missing word shifts all subsequent positions, and that word count mismatches should be resolved by finding the best alignment rather than forcing a one-to-one map.

What comes back is a clean JSON structure. Our Python backend then post-processes it: capitalisation differences are suppressed in code (not left to the model's discretion), scores are calculated from the alignment counts, and the result is stored against the session.

The vocabulary problem: 2,900 words you can't paste into a prompt

The other major engineering challenge was applying the AQA GCSE vocabulary guidelines to sentence generation without destroying prompt quality.

The official AQA Subject-Specific Vocabulary lists are enormous. French alone contains over 2,900 entries at Higher tier. Pasting that into every generation prompt would have been slow, expensive, and counterproductive — models lose focus when flooded with thousands of raw word entries.

The approach we used was to distil, not dump. We read the specifications carefully and converted them into three compact, purpose-built structures:

Tier profiles — instead of listing every allowed word, we encoded the grammar rules. French Foundation students should only see passé composé, present tense, and near future — never the subjunctive or conditional. These exclusions are now explicit in a small config object that adds a few dozen tokens to each prompt, not thousands.

High-frequency verb lists — drawn from the AQA core vocabulary data, used when students want to practise the most exam-critical verbs specifically.

Sampled exam word sets — the full vocabulary lists were sampled by language, tier, and topic into representative sets of ~20 words each. At runtime, 4 words are drawn at random and included in the prompt to nudge the AI toward genuine AQA vocabulary. The full list never touches the model.

The guidelines shaped the system's structure. At runtime, only a focused, relevant slice of that knowledge reaches the AI.

What we shipped

Scripto generates a fresh GCSE-level sentence in French, German, or Spanish — tailored to the student's level, topic, and tense preferences. ElevenLabs reads it aloud in a natural native voice. The student writes it on paper. They photograph it. GPT-4o reads the photo, aligns the handwriting against the original, and returns a marked result with per-word highlights and plain-English error explanations. Everything is saved to history and charted over time.

Each full session — generate, speak, mark — costs under one US cent in AI fees.

The hardest part wasn't the AI. It was knowing exactly what to ask the AI to do, and what to handle ourselves.