Trent N. Cash

Ph.D. Student at Carnegie Mellon University

Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments


Under Review (R&R - Preprint Available)


Trent N. Cash, Daniel M. Oppenheimer, Sara Christie


View PDF
Cite

Cite

APA   Click to copy
Cash, T. N., Oppenheimer, D. M., & Christie, S. Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments . https://doi.org/10.31234/osf.io/47df5


Chicago/Turabian   Click to copy
Cash, Trent N., Daniel M. Oppenheimer, and Sara Christie. “Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments ”(n.d.).


MLA   Click to copy
Cash, Trent N., et al. Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments . doi:10.31234/osf.io/47df5.


BibTeX   Click to copy

@article{trent-a,
  title = {Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments },
  doi = {10.31234/osf.io/47df5},
  author = {Cash, Trent N. and Oppenheimer, Daniel M. and Christie, Sara}
}

Abstract

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions, including those with inherent uncertainties, such as predictions about future events. The present studies investigate the capability of LLMs to quantify such uncertainty through confidence judgments. We evaluate the absolute and relative accuracy of confidence judgments from two LLMs (ChatGPT and Gemini) compared to human participants across three prediction domains: NFL game winners (Study 1a; n = 502), Oscar award winners (Study 1b; n = 109), and future Pictionary performance (Study 2; n = 164). Our findings reveal that LLMs’ confidence judgments closely align with those of humans in terms of accuracy, biases, and errors. However, unlike humans, LLMs struggle to adjust their confidence judgments based on past performance, highlighting a key area for improvement in their design.