Agentic AI MOOC Fall 2025 - video 08 - 1:04:37

Training agentic models

Base and chat models know language, but agentic work needs persistence, tool discipline, and recovery from failure.

post-trainingcurriculumtools

Open original video Start from the problem Practice cards

Training Agentic Models by Weizhu Chen

Problem-first learning

The problem this lecture is trying to solve

Base and chat models know language, but agentic work needs persistence, tool discipline, and recovery from failure.

Lowest-level failure mode

Training must make the model choose actions under partial observability and delayed reward.

Frontier update

Agent training is moving from prompt imitation toward interaction data, verifiable outcomes, and environment mutation.

Transcript-grounded route

How the lecture unfolds

This is built from 1,495 caption segments. Use the timestamp buttons to jump into the original video when a term feels fuzzy.

Pass 1: Actually

The lecture segment repeatedly returns to actually, that, just, data, what. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Pass 2: Actually

The lecture segment repeatedly returns to actually, data, that, just, synthesis. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Pass 3: Actually

The lecture segment repeatedly returns to actually, just, grader, that, data. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Pass 4: Actually

The lecture segment repeatedly returns to actually, that, just, able, different. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Pass 5: Actually

The lecture segment repeatedly returns to actually, that, lego, just, post-training. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Pass 6: That

The lecture segment repeatedly returns to that, actually, just, able, data. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Training must make the model choose actions under partial observability and delayed reward.

Build the mental model

What you should understand after this lecture

1. Start from the bottleneck

Base and chat models know language, but agentic work needs persistence, tool discipline, and recovery from failure. The lecture is useful because it does not treat this as a naming problem. It asks what breaks at the operational level and what design pattern removes that break.

2. Name the moving parts

The recurring vocabulary in the transcript is actually, that, just, data, very, able. When studying, do not memorize these as separate buzzwords. Ask what state is stored, what action is chosen, what feedback is observed, and what verifier decides whether progress happened.

3. Convert the idea into an architecture

Post-training shapes reasoning style, tool calling, and exploration. Synthetic tasks are useful only if they preserve real failure modes. Agents need curricula for planning depth and error recovery. In exam or interview answers, this becomes a four-part answer: objective, loop, control boundary, evaluation.

4. Know the failure case

Training must make the model choose actions under partial observability and delayed reward. If you cannot say how the proposed system fails, the explanation is still shallow. Always include the failure it prevents and the new cost it introduces.

Concept weave

Ideas to remember

Post-training shapes reasoning style, tool calling, and exploration.
Synthetic tasks are useful only if they preserve real failure modes.
Agents need curricula for planning depth and error recovery.

Visual model

Agent system view

Use the graph to ask where the intelligence really lives: model, memory, tools, environment, verifier, or orchestration.

Written practice

Questions that make the idea stick

Drill 1Create an agent training curriculum.

Start with single-tool tasks.
Add noisy observations and retries.
End with long-horizon tasks and hidden state.

Drill 2Diagnose overfitting to benchmark format.

Change surface wording.
Mutate environment state.
Check if success survives changed tools.

Written answer pattern

How to write this under pressure

ClaimTraining agentic models solves a concrete control problem, not just a prompt-writing problem.

MechanismState the loop: observe state, choose action/tool, get feedback, update memory or plan, stop using a verifier.

Why it worksIt makes the hidden failure mode visible: Training must make the model choose actions under partial observability and delayed reward.

TradeoffExtra orchestration improves reliability only if evaluation, cost, and authority boundaries are explicit.

Build skill

How to apply this in your own agent

Write the concrete task and the failure mode before choosing any framework.
Choose the smallest architecture that handles the failure: workflow, single agent, orchestrator-worker, or evaluator loop.
Define tool schemas, memory boundaries, and a success checker.
Run a small eval set with failure labels, cost, latency, and trace review.

Source route

Original course links and readings

Course pagehttps://rdi.berkeley.edu/agentic-ai/f25 Course slideshttps://rdi.berkeley.edu/agentic-ai/slides/weizhu.pdf Tulu 3https://arxiv.org/abs/2411.15124 Iterative Reasoning Preference Optimizationhttps://arxiv.org/abs/2404.19733

Page generated from 1,495 YouTube captions. Raw transcript files are kept out of the public site; this page publishes study notes, timestamp routes, and paraphrased explanations.