Still Hallucinating?

Can we avoid language model hallucinations with Retrieval-Augmented Generation?

The News

In a recent blog post, Teresa Kubacka describes some problematic responses to queries she made to a system designed to answer questions based on a large index of abstracts of research papers.

Kubecka’s questions were about electromagnons, a phenomenon in condensed matter physics which were the subject of her Ph.D. dissertation. She found that its answers often described electromagnons using direct quotes from abstracts which were not about electromagnons.

For instance, in response to the question

“What is the relationship between electromagnons and magnons in the field of condensed matter physics?”

electromagnons are described as

“excitations of the electric dipole order in ferroelectrics.”

However, in the original abstract, the full passage is

“Here we compare the magnon with the ferron, i.e. the elementary excitation of the electric dipole order…”1

making it clear that this is a description of a ferron, not a magnon at all.

This identification of one term from the question with the description of a different term bears a striking resemblance to the inventions or ‘hallucinations’ which large language models like ChatGPT produce, particularly when asked questions to which they do not know the answers.

Kubecka cites documentation from Scopus, the publisher of the tool, indicating that this system uses Retrieval-Augmented Generation, introduced by Lewis et al. 2021,2

What are Hallucinations?

Generative Language Models can produce nonsensical or false answers. In some cases, these false answers simply reflect the human beliefs and misconceptions in their training material.3 However, in other cases, the material does not accurately reflect either the training material or any additional information supplied,4,5 a phenomenon known as ‘hallucinations.’ OpenAI acknowledges in their GPT-4V(ision) system card,6 that “GPT-4 has the tendency to “hallucinate,” i.e. “produce content that is nonsensical or untruthful in relation to certain sources.”

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG), introduced by Lewis et al. 2021,2 is a commonly used technique in modern question answering systems. It works by supplying the generative model with relevant information based on the question. The basic idea is to ensure that the generative model has the necessary information to answer the question, and does not need to have seen that information during training.

Shuster et al. 2021 found that this technique also reduced hallucinations by over 60%, but did not eliminate it.8

Why doesn’t RAG help in this case?

There are three issues with this response:

  1. The retrieved documents do not actually mention “electromagnons”.
  2. Nonetheless, the LLM appears to give an answer specific to electromagnons
  3. The given answer is not even related to magnons, but to a different excitation

Each of these issues can teach us something.

Lesson 1: The quality of the retrieval matters for RAG. The closest matches to the query may not be good matches. (Actually, in this case, Kubecka noted that a simple keyword search did retrieve abstracts with “electromagnon” in the database, though for some reason the NLP query system did find them).

Lesson 2: In the absence of relevant information, LLMs are still prone to make things up. Had the system been designed to show the broader context, a user might detect this error. Another promising solution is R-tuning, introduced by Zhang et al 2023,9 which explicitly trains LLMs to say “I don’t know”, when the available information does not provide an answer, and not just on correct examples.

Lesson 3: From the context, we see that both “magnon” (though not the more specific term “electromagnon” from the query) and “ferron” appear shortly before the quoted passage. A generative language model creates responses by looking at preceding text and finding the text which is statistically most likely to follow. Large language models like GPT-4 distinguish themselves from earlier, smaller language models by their ability to look well beyond the immediate context, so in retrospect, the fact that the system missed the distinction between the two terms is not all that surprising. But we should remember that that despite their impressive capabilities, current language models remain quite limited and naive in their ability to understand the nuances of human language.

Building a truly robust RAG system will likely require combining multiple strategies, not just relying on one fix. For additional strategies, see the references below, especially Zhang et al 20239 and Elarby et al 2023.11

References

  1. Bauer, Gerrit E. W., Ping Tang, Ryo Iguchi, and Ken-ichi Uchida. “Magnonics vs. Ferronics.” Journal of Magnetism and Magnetic Materials 541 (January 1, 2022): 168468. https://doi.org/10.1016/j.jmmm.2021.168468.
    ↩︎
  2. Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 9459–74. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
    ↩︎
  3. Lin, Stephanie, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, 3214–52. Dublin, Ireland: Association for Computational Linguistics, 2022. https://doi.org/10.18653/v1/2022.acl-long.229.
    ↩︎
  4. Massarelli, Luca, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, and Sebastian Riedel. “How Decoding Strategies Affect the Verifiability of Generated Text.” In Findings of the Association for Computational Linguistics: EMNLP 2020, edited by Trevor Cohn, Yulan He, and Yang Liu, 223–35. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.22.
    ↩︎
  5. Maynez, Joshua, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. “On Faithfulness and Factuality in Abstractive Summarization.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, edited by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, 1906–19. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-main.173.
    ↩︎
  6. https://openai.com/research/gpt-4v-system-card ↩︎
  7. Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 9459–74. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
    ↩︎
  8. Shuster, Kurt, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. “Retrieval Augmentation Reduces Hallucination in Conversation.” In Findings of the Association for Computational Linguistics: EMNLP 2021, edited by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, 3784–3803. Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.320.
    ↩︎
  9. Zhang, Hanning, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. “R-Tuning: Teaching Large Language Models to Refuse Unknown Questions.” arXiv, November 16, 2023. https://doi.org/10.48550/arXiv.2311.09677.
    ↩︎
  10. Zhang, Hanning, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. “R-Tuning: Teaching Large Language Models to Refuse Unknown Questions.” arXiv, November 16, 2023. https://doi.org/10.48550/arXiv.2311.09677.
    ↩︎
  11. Elaraby, Mohamed, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, and Yuxuan Wang. “Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models.” arXiv, September 13, 2023. http://arxiv.org/abs/2308.11764.
    ↩︎


Comments

Leave a Reply

Image by kjpargeter on Freepik