Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer
ML Safety Newsletter #7
Making model dishonesty harder, making grokking more interpretable, an example of an emergent internal optimizer