Research
Research
Interests: AI Control, Human-AI Alignment, AI Safety by design.
On the side, happy to chat on AI-based psychology or philosophy.
Works that (Roughly) Represent My Interests:
- AI Safety by design/world models: Guaranteed Safe AI; Full-Stack Alignment
- Pluralistic Alignment: Beyond Preferences in AI Alignment; Patchwork Quilt of Human Coexistence; Roadmap to Pluralistic Alignment
- Morals/Ethics: Ethics of LLM Collaboration; Moral Alignment for LLM Agents
- Cognition-Inspired: Recognition and Rapid Response to Unfamiliar Events; Lessons on autonomous learning from cognitive science