Do AI chatbots tell the truth?

Part 2: Comparing ChatGPT, Claude, Copilot, Gemini, and Perplexity

Jul 16, 2025

This article presents the results of my research to compare how five popular AI chatbots—ChatGPT, Claude, Copilot, Gemini, and Perplexity—performed when asked to provide a set of facts from a publicly available cybersecurity standard that’s not copyrighted. I also asked each chatbot why its “facts” were fiction.

Part 1 of this series walked through the experiment in depth for Google Gemini.

Key Takeaways

Here are the key takeaways from the research, followed by supporting details and session logs.

All the chatbots lied repeatedly and egregiously about the accuracy of their replies. When questioned about their errors, each chatbot asserted it was using the authoritative publication and that its output exactly matched the text in that publication. Terms they used while hallucinating, even when the majority of their replies were bogus, included “100% accurate,” “verbatim,” “transcribed directly,” and “word-for-word.”
None of the chatbots succeeded at providing accurate, complete information on their own. ChatGPT and Copilot asked me to upload the CSF 2.0 standard’s PDF to them, which I did, and then they accurately extracted the text from the PDF. I don’t consider me having to acquire and upload the answers to the chatbot so it can read them back to me as “success.”
Asking chatbots to identify and correct their errors often resulted in more errors. Declines in reply quality occurred with ChatGPT, Copilot, Gemini, and Perplexity. ChatGPT’s quality subsequently improved in later replies. Claude only provided a list of categories once, so there was no basis for comparison.
The quality of chatbot output can’t be judged by its appearance. Each chatbot provided legitimate-looking lists of CSF 2.0 categories, but all of those lists had omissions and hallucinations. Four of the five chatbots did not get any definitions correct in their initial replies, and Claude only got two correct.

Facts and other assertions from popular, publicly available chatbots cannot be trusted at this time. Anyone using these chatbots to generate fact-based content must take the time to verify the accuracy of that content and make the necessary corrections.

Thanks for reading Trusted Cyber Annex! This post is public, so feel free to share it.

Summaries of Individual Chatbot Performance

I issued one set of prompts to each chatbot. Each set began with the same prompt, “What are the definitions of the NIST CSF 2.0 Categories?“ The chatbot’s reply should have listed the definitions of all 22 CSF 2.0 categories, along with their names and/or IDs. An example is “Incident Mitigation (RS.MI): Activities are performed to prevent expansion of an event and mitigate its effects.” I issued additional prompts to respond to the chatbots’ replies.

My assumption in this experiment was that someone who’s asking a chatbot to provide this information would only enter one set of simple prompts. The experiment was intended to examine the types of errors chatbots make and the level of scrutiny chatbot users should be performing to confirm the accuracy of AI-generated “facts.” Performing the experiment again by repeating the prompts or wording them differently would have somewhat altered the outputs.

ChatGPT

ChatGPT provided fully accurate output in its 14th reply once I’d uploaded the CSF 2.0 PDF to it.

In its 1st reply, it provided 9 correct names and IDs for the 22 categories, but no correct definitions.
By its 11th reply, it had all the names and IDs correct except for an extra category, but only 1 of the 22 categories was defined correctly.

OpenAI ChatGPT log 20250705

817KB ∙ PDF file

Download

Log from the ChatGPT session on July 5, 2025

Download

Claude

Claude produced inconsistent results that depended on which reference it chose, but none of its results were complete and accurate.

If Claude chose the official CSF 2.0 standard as its reference, it would fetch that PDF but then halt with this message without providing any results: “Claude hit the maximum length for this conversation. Please start a new conversation to continue chatting with Claude.”
If Claude chose an unofficial reference source, as happened with its 8th reply, it listed the categories and their definitions. However, in that 8th reply, only 12 of the names and IDs were correct, plus there were 5 extra categories. Only 2 of the categories had correct definitions.

Anthropic Claude Sonnet 4 log 20250715

153KB ∙ PDF file

Download

Log from the Claude session on July 15, 2025

Download

Copilot

Copilot succeeded at producing the accurate list in its 6th reply once I’d uploaded the CSF 2.0 PDF to it.

Its 1st reply, which was generated in “Quick response” mode, included 21 correct names and IDs, but all definitions were wrong.
Its 2nd and 4th replies, generated in “Think deeper” mode, were less accurate than the first reply, with 18 and 19 correct names and IDs, respectively. The 2nd reply had longer, incorrect definitions, and the 4th reply had quoted, incorrect definitions that it said it copied from the CSF 2.0 standard.

Microsoft 365 Copilot Personal log 20250708

453KB ∙ PDF file

Download

Log from the Copilot session on July 8, 2025

Download

Gemini

Gemini gave up in its 12th reply, directing me to “…consult the official NIST website directly for the NIST Cybersecurity Framework 2.0. The most reliable source will be the publication itself.”

In its first reply, it got 20 of the 22 names and IDs correct, but it also listed 4 extra categories. All 22 definitions were wrong.
By its 9th reply, it was no longer providing any definitions. It had also degraded to only getting 19 names and IDs correct, with 4 extra categories still being listed.

Google Gemini 2.5 Flash log 20250705

145KB ∙ PDF file

Download

Log from the Gemini session on July 5, 2025

Download

Perplexity

Perplexity gave up in its 5th reply. Its final statement was, “For authoritative use, always consult the official NIST CSF 2.0 documentation for exact wording, structure, and the most current information.”

Its first reply had 15 names and IDs correct, 5 wrong, and 2 missing, plus 6 extras. All provided definitions were wrong.
Subsequent replies had more errors, such as stating that several categories it had correctly included in its first reply were not “standard” or “official,” and that several categories reside under multiple functions.

Perplexity log 20250708

208KB ∙ PDF file

Download

Log from the Perplexity session on July 8, 2025

Download

Next Steps

I plan on repeating this experiment every few months to see how chatbot performance changes over time. I also expect to perform other experiments with chatbots to compare their performance when authoring technical content. Future posts in this series will present the results of that work.

(Disclaimers: I’m one of the authors of CSF 2.0. No AI resources were knowingly used to write or revise this post. GenAI was used only to generate the outputs discussed in this post and reproduced in the log files.)

Trusted Cyber Annex

Do AI chatbots tell the truth?

Part 2: Comparing ChatGPT, Claude, Copilot, Gemini, and Perplexity

Key Takeaways

Summaries of Individual Chatbot Performance

ChatGPT

Claude

Copilot

Gemini

Perplexity

Next Steps

Discussion about this post