ML Safety Newsletter
Subscribe
Sign in
Home
Archive
About
ML Safety Newsletter #11
Top Safety Papers of 2023
Dec 14, 2023
•
Dan Hendrycks
and
Aidan O'Gara
September 2023
ML Safety Newsletter #10
Adversarial attacks against language and vision models, improving LLM honesty, and tracing the influence of LLM training data
Sep 13, 2023
•
Dan Hendrycks
and
Aidan O'Gara
April 2023
ML Safety Newsletter #9
Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans
Apr 11, 2023
•
Dan Hendrycks
and
Thomas Woodside
February 2023
ML Safety Newsletter #8
Interpretability, using law to inform AI alignment, scaling laws for proxy gaming
Feb 20, 2023
•
Dan Hendrycks
and
Thomas Woodside
January 2023
ML Safety Newsletter #7
Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer
Jan 9, 2023
•
Dan Hendrycks
October 2022
ML Safety Newsletter #6
Transparency survey, provable robustness, models that predict the future
Oct 13, 2022
•
Dan Hendrycks
September 2022
ML Safety Newsletter #5
Safety competitions with more than $1 million in prizes
Sep 26, 2022
•
Dan Hendrycks
June 2022
ML Safety Newsletter #4
Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness
Jun 3, 2022
•
Dan Hendrycks
March 2022
ML Safety Newsletter #3
Transformer adversarial robustness, fractals, preference learning
Mar 8, 2022
•
Dan Hendrycks
December 2021
ML Safety Newsletter #2
Adversarial Training, Feature Visualization, and Machine Ethics
Dec 9, 2021
•
Dan Hendrycks
October 2021
ML Safety Newsletter #1
ICLR Safety Paper Roundup
Oct 18, 2021
•
Dan Hendrycks
Share
Copy link
Facebook
Email
Note
Other
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts