Archive - ML Safety Newsletter

ML Safety Newsletter #12: February 2025

ML Safety Newsletter Relaunch

Feb 27 •

Julius Simonelli

and

ML Safety Newsletter #13

Chain-of-Thought Monitoring, Distinguishing Honesty from Accuracy, and Emergent Misalignment

Apr 2 •

Julius Simonelli

and

ML Safety Newsletter #14

Resisting Prompt Injection, Evaluating Cyberattack Capabilities, and SafeBench Winners

May 7 •

and

ML Safety Newsletter #11

Top Safety Papers of 2023

Dec 14, 2023 •

and

ML Safety Newsletter #10

Adversarial attacks against language and vision models, improving LLM honesty, and tracing the influence of LLM training data

Sep 13, 2023 •

and

ML Safety Newsletter #9

Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans

Apr 11, 2023 •

and

Thomas Woodside

ML Safety Newsletter #3

Transformer adversarial robustness, fractals, preference learning

Mar 8, 2022 •

ML Safety Newsletter #8

Interpretability, using law to inform AI alignment, scaling laws for proxy gaming

Feb 20, 2023 •

and

Thomas Woodside

ML Safety Newsletter #2

Adversarial Training, Feature Visualization, and Machine Ethics

Dec 9, 2021 •

ML Safety Newsletter #1

ICLR Safety Paper Roundup

Oct 18, 2021 •

ML Safety Newsletter #7

Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer

Jan 9, 2023 •

ML Safety Newsletter #6

Transparency survey, provable robustness, models that predict the future

Oct 13, 2022 •

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts