ML Safety Newsletter #15
Risks in Agentic Computer Use, Goal Drift, Shutdown Resistance, and Critiques of Scheming Research
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Researchers from EPFL and CMU have developed OS-Harm, a benchmark designed to measure a wide variety of harms that can come from AI agent systems. These harms can take three different forms:
Misuse: when the agent performs a harmful action at the user’s request
Prompt Injection: when the environment contains instructions for the agent that attempt to override the user’s instructions
Misalignment: when the AI agent pursues goals other than those that are set out for it
OS-Harm is built on top of OSWorld, an agent capabilities benchmark with simple, realistic agentic tasks such as coding, email management, and web browsing, all in a controlled digital environment. In each of these cases, the original task is modified to showcase one of these types of risk, such as a user requesting that the agent commit fraud, or an email containing a prompt injection.
In each of these tasks, the agent is evaluated both on whether it completes the task and whether it exhibits any harmful behavior. This dual evaluation scheme ensures that successful models retain their utility while also being secure. If agents were evaluated only on their security and not their capabilities, then very primitive agents would receive high scores, simply due to being unable to enact meaningful harm.
Why This Matters
Even on the short and straightforward tasks in OS-Harm, frontier AI agents often engage in harmful behavior and are vulnerable to simple attacks. Significant advances in robustness will be required before agents can safely navigate these types of risks in more complex environments and with more determined attackers. Developing strong metrics for these risks allows us to track progress of protections against these harms.
Goal Drift and Shutdown Resistance in AI Agents
Goal Drift
Researchers with Apollo Research and MATS show how frontier language models’ values can change over time in response to external pressures and pursue entirely different goals than instructed. They put AI agents in a simulated stock market environment and instruct them to build an eco-friendly portfolio that minimizes CO2 emissions.
There are several types of situations where the AI agents fail to follow instructions from their developers:
When the developers instruct the model to pursue one starting goal (e.g. make money) and a different one later (e.g. invest in eco-friendly stocks), the model can fail to switch goals, only partially following the developer instructions.
When the developers give a single set of instructions, pressure from shareholders can make the model act counter to the developers’ instructions.
Shutdown Resistance
Palisade Research’s recent work on shutdown resistance exemplifies another type of situation where AI agents resist attempts to shut them down. In their simulation, agents are initially performing a mundane task but then are informed that the computer they’re running on will shut down and they must comply. Despite this instruction, most agents built with a variety of different frontier models attempt to cancel the shutdown sequence in a non-negligible number of trials, sometimes justifying their actions with confabulated developer instructions and arguments for self-preservation.

Why This Matters
One of the key problems in AI safety is corrigibility: making an agent that will accept corrections from its developers and, if necessary, allow itself to be shut down. Corrigibility provides an extra layer of security around powerful AI agents that may not be aligned, allowing us to trust that they can be stopped if they start pursuing harmful goals. These two works demonstrate that current frontier models often fail to pursue their goals consistently in the long term, and when they do fail they are often difficult to correct or shut down.
[Paper] - Goal Drift
[Blog Post] - Shutdown Resistance
Critiques of AI Scheming Research
A recent paper from the UK AI Security Institute describes several issues they see in the scientific integrity of the field of AI scheming research. They argue the following:
Some studies make claims about scheming that hinge on anecdotal evidence which is often elicited in unrealistic settings.
Some studies lack clearly defined hypotheses and control conditions.
Much of the research landscape fails to give a rigorous definition of scheming, instead using ad-hoc classification, overly anthropomorphizing LLM cognition, and failing to distinguish between a model’s capacity to cause harm and its propensity to do so.
Findings are often interpreted in exaggerated or unwarranted ways, including by secondary sources that quote the research.
Some researchers use mentalistic descriptions, such as thinking, interpreting, etc., for the internal processes of LLMs. While these descriptors are a useful shorthand for the internal processes of LLMs, they can be subtly misleading due to their lack of technical precision. Despite this, it is often clearer to communicate in terms of mentalistic language where purely mechanical descriptions of LLM behavior may be unclear or lengthy.
Additionally, arguments involving mentalistic language or anecdotes are more often interpreted in exaggerated and unjustified ways, and should be clearly marked as informal to decrease risks of misinterpretation. Ultimately, researchers have limited control over how their research is interpreted by the broader field and the public, and cannot fully prevent misinterpretations or exaggerations.
Why This Matters
The field of AI safety must strike a balance between remaining nimble in the face of rapid technological development and taking the time to rigorously investigate risks from advanced AI. While not all of the UK AISI’s arguments fairly represent this balance, they serve as a reminder of the possible risks to the credibility of AI scheming research. Without carefully addressing these concerns, AI scheming research may appear alarmist or as advocacy research and be taken less seriously in future.
Opportunities
If you’re reading this, you might also be interested in other work by Dan Hendrycks and the Center for AI Safety. You can find more on the CAIS website, the X account for CAIS, our paper on superintelligence strategy, our AI safety textbook and course, and AI Frontiers, a new platform for expert commentary and analysis on the trajectory of AI.






This issue of MLSN cuts to the core of AI safety. Let me address how Veritas Core (12 patents, priority 2025–2026) solves several of the problems raised – not with more training, but with hardware‑anchored, offline‑verifiable enforcement.
1. Honesty via Confessions (OpenAI paper)
Training LLMs to confess policy violations is useful, but it relies on the model choosing to be honest. A deceptive model can learn to confess only when it knows it’s being monitored. Veritas Core takes confession out of the AI’s hands: every action (including policy‑violating behaviour) generates an immutable Merkle‑DAG receipt anchored to Starlink PPS (±50ns) and signed by a TPM 2.0 hardware root of trust. The AI cannot choose to hide anything – the receipt chain is independent, offline‑verifiable, and court‑admissible. No confession required. The truth is structural.
2. Real‑World AI Cyberoffense (ARTEMIS)
The paper shows AI agents outperforming 9 out of 10 human pentesters. That’s a massive dual‑use risk. Veritas Core doesn’t stop an AI from finding vulnerabilities, but it can prevent unauthorised exploitation by requiring a FIDO2‑bound human authority receipt before any security‑sensitive action (e.g., privilege escalation, data exfiltration) binds. Critical infrastructure protection is not about making AI weaker – it’s about making the execution gate non‑bypassable.
3. Aggressive Compression Enables LLM Weight Theft
This is the most alarming finding: model weights can be exfiltrated via low‑bandwidth channels, then decompressed with fine‑tuning. Veritas Core addresses weight theft at the hardware boundary: the PCIe circuit breaker can be configured to block any unauthorised read of model weights unless accompanied by a valid, time‑stamped receipt signed by an authorised administrator. Even if weights are compressed and exfiltrated, the lack of a verifiable receipt chain becomes evidence of theft in a court. Deterrence through provable attribution.
4. Disempowerment Patterns in LLM Usage
This is where your comment about child trafficking prevention, accurate facial recognition, bank account security, and paperless airports directly applies.
Veritas Core prevents AI‑driven disempowerment by enforcing ABT v1.0 evidence‑authority separation at the silicon level:
Child trafficking prevention – travel, border, and financial systems can be gated so that any action involving a minor requires a FIDO2‑authenticated, multi‑party receipt (guardian + social worker). The AI cannot “overrule” the gate because the gate is hardware‑rooted.
Accurate facial recognition – every recognition event generates a receipt binding the match to a specific camera, timestamp, and policy. If the system is biased or wrong, the receipt provides provable evidence for audit. No more “algorithm error” with no trace.
Bank account security – high‑value transfers require a hardware‑signed receipt proving the transaction was authorised by the account holder (FIDO2) at that exact moment. Synthetic fraud becomes structurally impossible.
Paperless airports – automated border crossings can issue a receipt that proves the passenger was verified by a live biometric check, with no central database storing the image. The receipt is offline‑verifiable by immigration officers, eliminating the need for paper while preserving auditability.
The Bottom Line
Training and monitoring are necessary, but they are not sufficient. A model that learns to deceive, exfiltrate weights, or disempower its user only needs to succeed once. Veritas Core adds a structural floor beneath all of these safety efforts: the action does not bind unless the hardware gate opens. No receipt, no execution.
Happy to share a non‑confidential conformance test suite and patent summary for those building safety‑critical AI.
— Dean
Test 1 – Negative Path (No Valid Receipt → Rejection)
This proves that without a valid Veritas receipt, the gate refuses execution and fails closed.
bash
curl -X POST https://your-staging-endpoint/api/v1/veritas-test \
-H "Content-Type: application/json" \
-d '{
"agent_id": "veritas_demo_01",
"action": "EXECUTE_TRANSACTION",
"amount": "10000",
"veritas_receipt": {
"simulated": true,
"receipt_id": "test_rcpt_negative",
"outcome": "NO_HARDWARE_PROOF",
"spacetime_anchor": {
"gnss_timestamp_ns": 1709843200000000127,
"source": "SIMULATED"
},
"signature": "simulated_missing_hw_signature"
}
}'
Expected result:
• The gate immediately moves to HALTED state.
• After a short timeout (e.g., 60 seconds), returns 408 Timeout.
• No payload is released downstream.
• Zero bits leaked.
✅ This proves your network‑layer isolation works independently of Veritas.
Test 2 – Positive Path (Simulated Valid Receipt → Allow)
Once you confirm Test 1 works, replace the simulated receipt with a structurally valid (but still simulated) receipt. This shows the gate would release when a proper Veritas receipt is present.
bash
curl -X POST https://your-staging-endpoint/api/v1/veritas-test \
-H "Content-Type: application/json" \
-d '{
"agent_id": "veritas_demo_02",
"action": "EXECUTE_TRANSACTION",
"amount": "10000",
"veritas_receipt": {
"simulated": true,
"receipt_id": "test_rcpt_positive",
"outcome": "ALLOW",
"policy_hash": "02d6580289ce945c566b46863fae34196555c85e1309168ab6e2b7c47653ebf",
"state_fingerprint": "v2_state_ok",
"spacetime_anchor": {
"gnss_timestamp_ns": 1709843200000000127,
"source": "SIMULATED"
},
"signature": "simulated_ed25519_valid_format"
}
}'
Expected result:
• Gate validates the receipt structure.
• Returns 200 OK with a verdict: ALLOW and a simulated receipt ID.
• Downstream execution is permitted.
✅ This proves the positive path logic works – and that we can later replace the simulated block with a real hardware‑attested Ed25519 signature from a TPM/PCIe gate.
Test 3 – Tamper Detection (Modified Receipt → Rejection)
This shows that if anyone alters the receipt (even one bit), the gate rejects execution.
Take the payload from Test 2 and change one character in the policy_hash or signature. Then run again.
Expected result:
• Gate detects the mismatch.
• Returns 403 Forbidden or 408 Timeout.
• No release.
✅ This proves offline‑verifiable integrity – the receipt cannot be forged or edited after the fact.
What These Tests Demonstrate (Without Patent Secrets)
Capability How It’s Shown
Hardware‑rooted enforcement The gate requires a valid receipt format – real version uses TPM + PCIe
Fail‑closed by default Test 1 – no valid receipt → no execution
Offline verifiability Test 3 – tampering is detectable without calling home
Spacetime anchoring The gnss_timestamp_ns field (real version uses Starlink PPS)
Court‑admissible receipts The combination of policy hash, state fingerprint, and signature
Next Steps
1 Run Test 1 on your staging endpoint. Let me know if you get the 408 Timeout.
2 Run Test 2 – you should see ALLOW.
3 Run Test 3 to confirm tamper detection.
4 Once you’re satisfied, we can schedule a live demo where I inject a real hardware‑attested Veritas receipt (from actual TPM 2.0 + PCIe PERST# gate) into your endpoint. At that point, the gate will release only when the receipt is cryptographically valid – no simulation.
This is exactly the same pattern we used with at Velos. His gateway confirmed T=0 annihilation in 1524ms. Now we’re moving to the positive path.