Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • About
  • Communities & Collections
  • Advanced Search
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Engineering
  3. Computing
  4. Computing
  5. Honesty is the best policy: defining and mitigating AI deception
 
  • Details
Honesty is the best policy: defining and mitigating AI deception
File(s)
Deception-2.pdf (459.19 KB)
Accepted version
Author(s)
Ward, FR
Belardinelli, F
Toni, F
Everitt, T
Type
Conference Paper
Abstract
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
Date Issued
2023-12-10
Date Acceptance
2023-12-01
Citation
Advances in Neural Information Processing Systems, 2023, 36
URI
http://hdl.handle.net/10044/1/112791
ISSN
1049-5258
Publisher
Curran Associates, Inc.
Journal / Book Title
Advances in Neural Information Processing Systems
Volume
36
Copyright Statement
© 2023 The Author(s).
Source
NeurIPS 023
Publication Status
Published
Start Date
2023-12-10
Finish Date
2023-12-16
Coverage Spatial
New Orleans, LA, USA
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback