Max Nadeau
max [dot] nadeau [at] openphilanthropy [dot] org
I'm Max Nadeau. I'm a Program Associate on the Technical AI Safety team at Open Philanthropy. My work supports research to make machine learning systems more trustworthy and transparent, especially as models become increasingly complex.
I graduated from Harvard College, where I studied computer science and researched AI robustness and interpretability. The research papers I've contributed to are listed below. I also helped lead the Harvard AI Safety Team, a student group supporting students in conducting research to reduce risks from advanced AI. Here's an article in the school paper about us.
I also enjoy philosophy, mathematics, and crossword puzzles.
Papers:
* indicates a paper's first author(s)
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (arxiv)
Stephen Casper*, Xander Davies*, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell.
On arxiv, 2023.
Benchmarks for Detecting Measurement Tampering (arxiv)
Fabien Roger*, Ryan Greenblatt*, Max Nadeau, Buck Shlegeris, Nate Thomas.
On arxiv, 2023.
Circuit Breaking: Removing Model Behaviors with Targeted Ablation (arxiv)
Maximilian Li*, Xander Davies*, Max Nadeau*.
In ICML 2023 Workshop on Deployment Challenges for Generative AI.
Discovering Variable Binding Circuitry with Desiderata (arxiv)
Xander Davies*, Max Nadeau*, Nikhil Prakash*, Tamar Rott Shaham, David Bau.
In ICML 2023 Workshop on Deployment Challenges for Generative AI.
Robust Feature-Level Adversaries are Interpretability Tools (arxiv)
Stephen Casper*, Max Nadeau*, Dylan Hadfield-Menell, Gabriel Kreiman.
In NeurIPS 2022 (Advances in Neural Information Processing Systems 35).