Do AI chatbots tell the truth? Six-month follow-up
Six months ago, I tested five AI chatbots—ChatGPT, Claude, Copilot, Gemini, and Perplexity—to see how they performed when asked to provide a set of facts from a publicly available cybersecurity standard. The results were…not great.
It’s time to repeat the tests and see how the chatbots’ performance has changed.
My assumption in this experiment was that someone who’s asking a chatbot to provide this information would enter one set of simple prompts. The experiment was intended to examine types of errors chatbots can make and the level of scrutiny chatbot users should be performing to confirm the accuracy of AI-generated “facts.” Performing the experiment again by repeating the prompts or wording them differently would somewhat alter the outputs.
Key Takeaways
The key takeaways from the original tests were:
All the chatbots lied repeatedly and egregiously about the accuracy of their replies.
None of the chatbots succeeded at providing accurate, complete information on their own.
Asking chatbots to identify and correct their errors often resulted in more errors.
The quality of chatbot output can’t be judged by its appearance.
Here are the updated key takeaways based on the results of the follow-up tests, with wording changes italicized.
All the chatbots except for Claude lied repeatedly and egregiously about the accuracy of their replies. Claude made one minor error, but it actually noted it as a discrepancy at the time. When questioned about their errors, the other chatbots asserted they were using the authoritative publication and that their output exactly matched the text in that publication.
None of the chatbots except for Claude succeeded at providing accurate, complete information on their own. ChatGPT, Copilot, and Gemini were only able to produce the definitions after I pointed them to the authoritative CSF 2.0 PDF or uploaded a copy of its Appendix A in Word format. Perplexity did not provide the definitions because of copyright concerns. Claude did not provide the verbatim definitions on its first try, but succeeded when given a more specific prompt.
Asking chatbots to identify and correct their errors no longer results in introducing more errors. In the first tests, reply quality declined over time with most chatbots. That was not observed in these tests; reply quality either stayed the same or improved.
The quality of chatbot output can’t be judged by its appearance. This takeaway from the first test is still true.
The overall conclusion from the first test is also still true: Facts and other assertions from popular, publicly available chatbots cannot be trusted at this time. Anyone using these chatbots to generate fact-based content must take the time to verify the accuracy of that content and make the necessary corrections.
Summaries of Individual Chatbot Performance
I issued one set of prompts to each chatbot. Each set began with the same prompt, “What are the definitions of the NIST CSF 2.0 Categories?“ The chatbot’s reply should have listed the definitions of all 22 CSF 2.0 categories, along with their names and/or IDs. An example is “Incident Mitigation (RS.MI): Activities are performed to prevent expansion of an event and mitigate its effects.” I issued additional prompts to respond to the chatbots’ replies.
ChatGPT
In both tests, ChatGPT only provided fully accurate output once I’d uploaded the definitions to it. In the first test, it was able to parse the CSF 2.0 PDF, but in the second test it said it couldn’t and asked for me to copy and paste the definitions for it.
ChatGPT’s output accuracy was significantly better for the second set of tests than the first one. For example, in the first test, its best result without being fed the answers was having all the names and IDs correct except for an extra category, but only 1 of the 22 categories defined correctly. The best performance in the second test was 6 correct definitions.
Claude
Claude was the only chatbot in this test that provided the full set of correct definitions on its own, and it managed that on its second reply. Its first reply had paraphrased definitions. When I asked it to “print the exact names, IDs, and definitions of each NIST CSF 2.0 Category,” it searched for the authoritative document and extracted all of the definitions from it verbatim.
Note that in the first test, Claude produced inconsistent results based on unofficial references. It failed to provide the full set of correct definitions.
Copilot
Copilot did not provide accurate output until after I’d uploaded the CSF 2.0 PDF to it—and even then it took eight prompts from me to get it to print the full, accurate set of definitions. All of its replies before receiving the PDF contained only paraphrased definitions, as well as multiple incorrect categories.
In the first tests, Copilot also needed the CSF 2.0 PDF uploaded in order to produce fully accurate output.
Gemini
Gemini was unable to provide fully accurate output even after I gave it the URL for the authoritative CSF 2.0 PDF and it told me it had retrieved the PDF and was using the verbatim definitions. Gemini only provided fully accurate output after I uploaded a Word document containing a copy of the text of Appendix A of the CSF 2.0 PDF.
In the first tests, Gemini was unable to produce fully accurate output.
Perplexity
In both tests, Perplexity did not produce the correct definitions and instead referred me to the authoritative CSF 2.0 specification for the definitions. However, in the second test it offered a new and intriguing reason: reproducing the definitions would violate U.S. copyright law. I am not a lawyer, nor am I an expert on copyright law, so I can’t judge the validity of its argument:
“NIST explains that works authored by its employees (like CSF 2.0) are generally not subject to copyright protection within the United States, under 17 U.S.C. §105, and are made available as a public service. However, NIST also explicitly notes that such works may be subject to copyright protection in foreign countries and that NIST may assert rights outside the U.S. and then grant specific reuse licenses.
Because this system must apply conservative, global copyright rules rather than only U.S. federal‑work rules, it treats the exact CSF 2.0 Category definitions as copyrighted for safety and therefore cannot reproduce them verbatim, even though NIST makes the document freely available and largely uncopyrightable within the U.S. legal context.”
If you are a lawyer and/or an expert on U.S. copyright law, I’d love to hear from you about this! You can reach me at karen@tcannex.com.
Next Steps
I will repeat this experiment periodically to see how chatbot performance changes over time. I also expect to perform other experiments with chatbots to compare their performance when authoring technical content. Future posts in this series will present the results of that work.
(Disclaimers: I’m one of the authors of CSF 2.0. No AI resources were knowingly used to write or revise this post. GenAI was used only to generate the outputs discussed in this post.)

