Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness
Transformer adversarial robustness, fractals, preference learning
Adversarial Training, Feature Visualization, and Machine Ethics
ICLR Safety Paper Roundup