ML Safety Newsletter

ML Safety Newsletter

Home
Archive
About
ML Safety Newsletter #12: February 2025
ML Safety Newsletter Relaunch
Feb 27 • 
Julius Simonelli
 and 
Dan Hendrycks
ML Safety Newsletter #13
Chain-of-Thought Monitoring, Distinguishing Honesty from Accuracy, and Emergent Misalignment
Apr 2 • 
Julius Simonelli
 and 
Dan Hendrycks
ML Safety Newsletter #14
Resisting Prompt Injection, Evaluating Cyberattack Capabilities, and SafeBench Winners
May 7 • 
Alice Blair
 and 
Dan Hendrycks
ML Safety Newsletter #11
Top Safety Papers of 2023
Dec 14, 2023 • 
Dan Hendrycks
 and 
Aidan O'Gara
ML Safety Newsletter #10
Adversarial attacks against language and vision models, improving LLM honesty, and tracing the influence of LLM training data
Sep 13, 2023 • 
Dan Hendrycks
 and 
Aidan O'Gara
ML Safety Newsletter #9
Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans
Apr 11, 2023 • 
Dan Hendrycks
 and 
Thomas Woodside
ML Safety Newsletter #3
Transformer adversarial robustness, fractals, preference learning
Mar 8, 2022 • 
Dan Hendrycks
ML Safety Newsletter #8
Interpretability, using law to inform AI alignment, scaling laws for proxy gaming
Feb 20, 2023 • 
Dan Hendrycks
 and 
Thomas Woodside
ML Safety Newsletter #2
Adversarial Training, Feature Visualization, and Machine Ethics
Dec 9, 2021 • 
Dan Hendrycks
ML Safety Newsletter #1
ICLR Safety Paper Roundup
Oct 18, 2021 • 
Dan Hendrycks
ML Safety Newsletter #7
Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer
Jan 9, 2023 • 
Dan Hendrycks
ML Safety Newsletter #6
Transparency survey, provable robustness, models that predict the future
Oct 13, 2022 • 
Dan Hendrycks
© 2025 Substack Inc
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share