Agentic AI MOOC Fall 2025 - video 06 - 44:04

Benchmark noise and evaluation

A model looks better or worse because the benchmark is noisy, not because the agent improved.

evalsstatisticsbenchmarks

Open original video Start from the problem Practice cards

Predictable Noise in LLM Benchmarks by Sida Wang

Problem-first learning

The problem this lecture is trying to solve

A model looks better or worse because the benchmark is noisy, not because the agent improved.

Lowest-level failure mode

Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Frontier update

Frontier agent evals are moving toward dynamic, private, and mutation-based benchmarks because static sets saturate quickly.

Transcript-grounded route

How the lecture unfolds

This is built from 617 caption segments. Use the timestamp buttons to jump into the original video when a term feels fuzzy.

Pass 1: Benchmarks

The lecture segment repeatedly returns to benchmarks, that, more, evals, actually. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Pass 2: That

The lecture segment repeatedly returns to that, benchmarks, question, problems, questions. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Pass 3: What

The lecture segment repeatedly returns to what, benchmarks, variance, from, standard. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Pass 4: That

The lecture segment repeatedly returns to that, questions, actually, noise, from. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Pass 5: That

The lecture segment repeatedly returns to that, benchmarks, data, different, more. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Pass 6: Data

The lecture segment repeatedly returns to data, that, noise, what, add confidence intervals and error bars to agent scores.. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

Build the mental model

What you should understand after this lecture

1. Start from the bottleneck

A model looks better or worse because the benchmark is noisy, not because the agent improved. The lecture is useful because it does not treat this as a naming problem. It asks what breaks at the operational level and what design pattern removes that break.

2. Name the moving parts

The recurring vocabulary in the transcript is that, benchmarks, more, from, questions, what. When studying, do not memorize these as separate buzzwords. Ask what state is stored, what action is chosen, what feedback is observed, and what verifier decides whether progress happened.

3. Convert the idea into an architecture

Add confidence intervals and error bars to agent scores. Separate model quality from scaffold, tool, and budget effects. Use repeated runs for non-deterministic agents. In exam or interview answers, this becomes a four-part answer: objective, loop, control boundary, evaluation.

4. Know the failure case

Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions. If you cannot say how the proposed system fails, the explanation is still shallow. Always include the failure it prevents and the new cost it introduces.

Concept weave

Ideas to remember

Add confidence intervals and error bars to agent scores.
Separate model quality from scaffold, tool, and budget effects.
Use repeated runs for non-deterministic agents.

Visual model

Agent system view

Use the graph to ask where the intelligence really lives: model, memory, tools, environment, verifier, or orchestration.

Written practice

Questions that make the idea stick

Drill 1Design an eval report for an agent.

Include success rate, confidence interval, cost, latency, and retries.
Show failure categories.
Report what changed between runs.

Drill 2Why can leaderboard scores mislead?

Benchmarks leak.
Tasks become saturated.
Scaffolds differ.
Budgets differ.

Written answer pattern

How to write this under pressure

ClaimBenchmark noise and evaluation solves a concrete control problem, not just a prompt-writing problem.

MechanismState the loop: observe state, choose action/tool, get feedback, update memory or plan, stop using a verifier.

Why it worksIt makes the hidden failure mode visible: Small sample sizes, correlated tasks, grader variance, and retry policies distort conclusions.

TradeoffExtra orchestration improves reliability only if evaluation, cost, and authority boundaries are explicit.

Build skill

How to apply this in your own agent

Write the concrete task and the failure mode before choosing any framework.
Choose the smallest architecture that handles the failure: workflow, single agent, orchestrator-worker, or evaluator loop.
Define tool schemas, memory boundaries, and a success checker.
Run a small eval set with failure labels, cost, latency, and trace review.

Source route

Original course links and readings

Course pagehttps://rdi.berkeley.edu/agentic-ai/f25 Course slideshttps://rdi.berkeley.edu/agentic-ai/slides/PredEval.pdf Adding Error Bars to Evalshttps://arxiv.org/pdf/2411.00640 OpenAI note on SWE-bench Verifiedhttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

Page generated from 617 YouTube captions. Raw transcript files are kept out of the public site; this page publishes study notes, timestamp routes, and paraphrased explanations.