Skip navigation
Full metadata record
DC FieldValueLanguage
dc.contributor.authorRein, David-
dc.contributor.authorBowman, Samuel-
dc.contributor.authoret al.-
dc.date.accessioned2024-09-30T01:54:26Z-
dc.date.available2024-09-30T01:54:26Z-
dc.date.issued2023-11-
dc.identifier.urihttp://hdl.handle.net/2451/74631-
dc.description.abstractWe present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.en
dc.description.sponsorshipThis project has benefited from financial support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program) and Open Philanthropy, and from in-kind support by the NYU High-Performance Computing Center. This material is based upon work supported by the National Science Foundation under Grant Nos. 1922658 and 2046556. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This work was supported by the Berkeley Existential Risk Initiative.en
dc.language.isoen_USen
dc.publisherProceedings of COLMen
dc.titleGPQA: A Graduate-Level Google-Proof Q&A Benchmarken
dc.typeDataseten
Appears in Collections:Machine Learning for Language Lab

Files in This Item:
File Description SizeFormat 
gpqa-main.zip5.45 MBUnknownView/Open


Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.