Research

Brief Research Overview

You can find a full overview of my papers on my Google Scholar profile.

My research focuses on understanding critical failures of AI systems in order to make them more robust and safe, primarily across two areas:

Understanding Model Psychology

  • "Looking Inward: Language Models Can Learn About Themselves by Introspection" FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, et al. arXiv:2410.13787 (2024)
  • "Is Model Collapse Inevitable?" M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, et al. arXiv:2404.01413 (2024)

Adversarial Robustness, Jailbreaking, & AI Control

  • "Best-of-n Jailbreaking" J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, et al. arXiv:2412.03556 (2024)
  • "Rapid Response: Mitigating LLM Jailbreaks with a Few Examples" A Peng, J Michael, H Sleight, E Perez, M Sharma arXiv:2411.07494 (2024)
  • "When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?" R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, et al. arXiv:2407.15211 (2024)
  • "Jailbreak Defense in a Narrow Domain" TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, et al. arXiv:2412.02159 (2024)
  • "Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats" J Wen, V Hebbar, C Larson, A Bhatt, A Radhakrishnan, M Sharma, et al. arXiv:2411.17693 (2024)
  • "Targeted Latent Adversarial Training" A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, et al. arXiv:2407.15549 (2024)
  • "Plan B: Training LLMs to Fail Less Severely" J Stastny, N Warncke, D Xu, A Lynch, F Barez, H Sleight, E Perez