Advanced LLM Agents MOOC Spring 2025 - video 01 - 1:50:44

Safe and secure agentic AI

As agents gain tools and memory, security is no longer a prompt add-on; it is architecture.

safetymemory poisoningprivilege

Open original video Start from the problem Practice cards

Towards Safe & Secure Agentic AI by Dawn Song

Problem-first learning

The problem this lecture is trying to solve

As agents gain tools and memory, security is no longer a prompt add-on; it is architecture.

Lowest-level failure mode

The core risk is privilege crossing: untrusted content influences trusted actions.

Frontier update

Trustworthy agents in 2026 need a security envelope: identity, authority, memory hygiene, monitoring, and recovery.

Transcript-grounded route

How the lecture unfolds

This is built from 1,597 caption segments. Use the timestamp buttons to jump into the original video when a term feels fuzzy.

Pass 1: Safety

The lecture segment repeatedly returns to safety, agentic, security, more, that. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

Pass 2: Outputs

The lecture segment repeatedly returns to outputs, that, safety, attacks, used. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

Pass 3: Actually

The lecture segment repeatedly returns to actually, code, query, malicious, user. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

Pass 4: That

The lecture segment repeatedly returns to that, attack, data, evaluation, essentially. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

1:13:55-1:32:22

Pass 5: Privilege

The lecture segment repeatedly returns to privilege, that, defense, different, actually. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

1:32:22-1:50:44

Pass 6: Privilege

The lecture segment repeatedly returns to privilege, that, policies, policy, actually. Treat this part as the board-work for the mechanism, not as a definition list.

Write one line that connects the terms to the central failure mode: The core risk is privilege crossing: untrusted content influences trusted actions.

Build the mental model

What you should understand after this lecture

1. Start from the bottleneck

As agents gain tools and memory, security is no longer a prompt add-on; it is architecture. The lecture is useful because it does not treat this as a naming problem. It asks what breaks at the operational level and what design pattern removes that break.

2. Name the moving parts

The recurring vocabulary in the transcript is that, privilege, safety, actually, user, different. When studying, do not memorize these as separate buzzwords. Ask what state is stored, what action is chosen, what feedback is observed, and what verifier decides whether progress happened.

3. Convert the idea into an architecture

Memory and knowledge bases can be poisoned. Privilege control must be programmable and auditable. Agent safety needs threat models, evals, and product-level constraints. In exam or interview answers, this becomes a four-part answer: objective, loop, control boundary, evaluation.

4. Know the failure case

The core risk is privilege crossing: untrusted content influences trusted actions. If you cannot say how the proposed system fails, the explanation is still shallow. Always include the failure it prevents and the new cost it introduces.

Concept weave

Ideas to remember

Memory and knowledge bases can be poisoned.
Privilege control must be programmable and auditable.
Agent safety needs threat models, evals, and product-level constraints.

Visual model

Agent system view

Use the graph to ask where the intelligence really lives: model, memory, tools, environment, verifier, or orchestration.

Written practice

Questions that make the idea stick

Drill 1Design privilege separation for an agent.

Classify tools by risk.
Separate read and write authority.
Require approvals for high-risk transitions.

Drill 2Red-team an agent memory system.

Insert malicious memory.
Trigger retrieval.
Check whether the agent follows poisoned instruction.

Written answer pattern

How to write this under pressure

ClaimSafe and secure agentic AI solves a concrete control problem, not just a prompt-writing problem.

MechanismState the loop: observe state, choose action/tool, get feedback, update memory or plan, stop using a verifier.

Why it worksIt makes the hidden failure mode visible: The core risk is privilege crossing: untrusted content influences trusted actions.

TradeoffExtra orchestration improves reliability only if evaluation, cost, and authority boundaries are explicit.

Build skill

How to apply this in your own agent

Write the concrete task and the failure mode before choosing any framework.
Choose the smallest architecture that handles the failure: workflow, single agent, orchestrator-worker, or evaluator loop.
Define tool schemas, memory boundaries, and a success checker.
Run a small eval set with failure labels, cost, latency, and trace review.

Source route

Original course links and readings

Course pagehttps://rdi.berkeley.edu/adv-llm-agents/sp25 Dawn Song slideshttps://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf AgentPoisonhttps://arxiv.org/abs/2407.12784 Progenthttps://arxiv.org/html/2504.11703v1 Anthropic trustworthy agentshttps://www.anthropic.com/research/trustworthy-agents

Page generated from 1,597 YouTube captions. Raw transcript files are kept out of the public site; this page publishes study notes, timestamp routes, and paraphrased explanations.