Archive - ML Safety Newsletter

ML Safety Newsletter #15

Risks in Agentic Computer Use, Goal Drift, Shutdown Resistance, and Critiques of Scheming Research

Aug 18 •

and

May 2025

ML Safety Newsletter #14

Resisting Prompt Injection, Evaluating Cyberattack Capabilities, and SafeBench Winners

May 7 •

and

April 2025

ML Safety Newsletter #13

Chain-of-Thought Monitoring, Distinguishing Honesty from Accuracy, and Emergent Misalignment

Apr 2 •

Julius Simonelli

and

February 2025

ML Safety Newsletter #12: February 2025

ML Safety Newsletter Relaunch

Feb 27 •

Julius Simonelli

and

December 2023

ML Safety Newsletter #11

Top Safety Papers of 2023

Dec 14, 2023 •

and

September 2023

ML Safety Newsletter #10

Adversarial attacks against language and vision models, improving LLM honesty, and tracing the influence of LLM training data

Sep 13, 2023 •

and

April 2023

ML Safety Newsletter #9

Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans

Apr 11, 2023 •

and

Thomas Woodside

February 2023

ML Safety Newsletter #8

Interpretability, using law to inform AI alignment, scaling laws for proxy gaming

Feb 20, 2023 •

and

Thomas Woodside

January 2023

ML Safety Newsletter #7

Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer

Jan 9, 2023 •

October 2022

ML Safety Newsletter #6

Transparency survey, provable robustness, models that predict the future

Oct 13, 2022 •

September 2022

ML Safety Newsletter #5

Safety competitions with more than $1 million in prizes

Sep 26, 2022 •

June 2022

ML Safety Newsletter #4

Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness

Jun 3, 2022 •

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts