AI Evaluation

Score open-ended answers automatically with a large language model: how AI evaluation works per question type, how to write an effective grading prompt, and how reviewers confirm or adjust the results.

Updated 2026/07/14

AI Evaluation scores open-ended responses automatically using a large language model. You describe what a good answer looks like in a grading prompt; when a candidate submits, the AI evaluates the response against it and produces a success rate plus written feedback. Reviewers can confirm or adjust every AI evaluation — the AI gives you a fast, consistent first pass, not an unappealable verdict.

Supported Question Types

Short Answer and Long Answer — the text is evaluated directly.
Audio and Video — the AI first produces a transcription of the spoken content, then evaluates the transcription. The transcription is stored alongside the evaluation, so reviewers read it next to the recording.
Code — the AI evaluates the written code; it is aware of any pre-filled initial template and judges the candidate's own contribution.

Transcription and AI feedback follow the language configured on the question (settings → Details → Language). Set it to French and a spoken French answer is transcribed and evaluated in French.

How It Works

Enable AI Evaluation on the question and write the grading prompt.
The candidate submits their answer; the evaluation runs automatically.
The AI returns a success rate (0–100%) and written feedback explaining the score. For audio and video, a transcription is produced as well.
The success rate flows through the question's multipliers and dimensions like any other score (see Scoring).
Reviewers see the AI's score, feedback, and reasoning in the evaluation screens and can confirm or override.

Writing an Effective Grading Prompt

State the criteria explicitly — what must be present for full marks, what earns partial credit, what fails. Vague prompts produce vague scores.
Anchor the scale — “100% requires X and Y; 50% if X without Y; 0% if neither” beats “grade generously”.
Include a model answer or key facts when the question has objective content — the AI grades against your reference, not its own opinion.
Say what to ignore — e.g. “do not penalize spelling” for a content-focused question, or “evaluate reasoning, not conclusion”.

Example for a long-answer question: “The answer must (1) identify the confounding variable in the study design, (2) explain why randomization fixes it, and (3) propose a concrete randomization procedure. Award ~33% per element. Do not penalize grammar. A correct identification with no explanation caps at 40%.”

Reviewing AI Results

AI evaluations appear in the evaluation screens marked as AI-scored, with the feedback and (for audio/video) the transcription attached. Reviewers can accept them as-is, adjust the score, or replace it with a manual evaluation — the human decision always wins. Combine AI Evaluation with a rubric to give both the AI and human reviewers the same structured criteria.

For high-stakes decisions, treat the AI score as triage: let it grade everything, then have humans verify the borderline band and a sample of the rest. You get consistency at scale without giving up accountability.
Recommended practice

AI Scores for Scored Dimensions

On questions with scored dimensions, the AI returns one score and written feedback per dimension alongside the overall result. A manual evaluation overrides the AI score per dimension; confirming or adjusting applies to the evaluation as a whole.