Agentic AI MOOC Fall 2025 - video 01 - 1:48:50

Agentic AI safety and security

Agents can take actions, so prompt injection, tool misuse, memory poisoning, and privilege escalation become operational risks.

securityprompt injectionguardrails

Open original video Start from the problem Practice cards

Agentic AI Safety & Security by Dawn Song

Problem-first learning

The problem this lecture is trying to solve

Agents can take actions, so prompt injection, tool misuse, memory poisoning, and privilege escalation become operational risks.

Lowest-level failure mode

Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

Frontier update

2026 agent safety emphasizes product decisions, privilege control, memory poisoning defenses, and trustworthy deployment practices.

Transcript-grounded route

How the lecture unfolds

This is built from 1,421 caption segments. Use the timestamp buttons to jump into the original video when a term feels fuzzy.

Pass 1: Security

The lecture segment repeatedly returns to security, agentic, that, safety, overall. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

Pass 2: Attacker

The lecture segment repeatedly returns to attacker, prompt, injection, prompt injection, that. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

Pass 3: That

The lecture segment repeatedly returns to that, actually, malicious, user, prompt injection. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

Pass 4: Attack

The lecture segment repeatedly returns to attack, attacker, different, teaming, that. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

1:12:33-1:30:44

Pass 5: Security

The lecture segment repeatedly returns to security, that, evaluation, defense, guardrails. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

1:30:44-1:48:50

Pass 6: Security

The lecture segment repeatedly returns to security, that, help, agentic, defense. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

Build the mental model

What you should understand after this lecture

1. Start from the bottleneck

Agents can take actions, so prompt injection, tool misuse, memory poisoning, and privilege escalation become operational risks. The lecture is useful because it does not treat this as a naming problem. It asks what breaks at the operational level and what design pattern removes that break.

2. Name the moving parts

The recurring vocabulary in the transcript is that, security, attack, agentic, different, actually. When studying, do not memorize these as separate buzzwords. Ask what state is stored, what action is chosen, what feedback is observed, and what verifier decides whether progress happened.

3. Convert the idea into an architecture

Separate user data, tool outputs, system policy, and secrets. Use least privilege for tools and scoped credentials. Add guardrails, approvals, audit logs, and red-team tests. In exam or interview answers, this becomes a four-part answer: objective, loop, control boundary, evaluation.

4. Know the failure case

Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated. If you cannot say how the proposed system fails, the explanation is still shallow. Always include the failure it prevents and the new cost it introduces.

Concept weave

Ideas to remember

Separate user data, tool outputs, system policy, and secrets.
Use least privilege for tools and scoped credentials.
Add guardrails, approvals, audit logs, and red-team tests.

Visual model

Agent system view

Use the graph to ask where the intelligence really lives: model, memory, tools, environment, verifier, or orchestration.

Written practice

Questions that make the idea stick

Drill 1Threat-model a browser agent.

List assets: accounts, data, tools, money movement.
List attack inputs: pages, emails, files, memory.
Add permission boundaries and confirmation gates.

Drill 2What is indirect prompt injection?

A malicious external document instructs the agent to ignore policy or exfiltrate data.
Defense: treat tool outputs as untrusted data.

Written answer pattern

How to write this under pressure

ClaimAgentic AI safety and security solves a concrete control problem, not just a prompt-writing problem.

MechanismState the loop: observe state, choose action/tool, get feedback, update memory or plan, stop using a verifier.

Why it worksIt makes the hidden failure mode visible: Untrusted input can influence trusted tool calls unless context, privileges, and approval boundaries are separated.

TradeoffExtra orchestration improves reliability only if evaluation, cost, and authority boundaries are explicit.

Build skill

How to apply this in your own agent

Write the concrete task and the failure mode before choosing any framework.
Choose the smallest architecture that handles the failure: workflow, single agent, orchestrator-worker, or evaluator loop.
Define tool schemas, memory boundaries, and a success checker.
Run a small eval set with failure labels, cost, latency, and trace review.

Source route

Original course links and readings

Course pagehttps://rdi.berkeley.edu/agentic-ai/f25 Course pagehttps://rdi.berkeley.edu/agentic-ai/f25 Anthropic trustworthy agentshttps://www.anthropic.com/research/trustworthy-agents OpenAI Agents SDK guardrailshttps://openai.github.io/openai-agents-python/guardrails/

Page generated from 1,421 YouTube captions. Raw transcript files are kept out of the public site; this page publishes study notes, timestamp routes, and paraphrased explanations.