Artificial Intelligence Hallucinations

Artificial Intelligence Hallucinations:
Managing Content Reliability in Automated Learning Material Generation

March 2025

Initial overview

Dating back to March 2024, we announced that we’d be bringing the functionality of creating courses from any PDF document to our community.

In the first couple of weeks of our release, our users had high satisfaction with their courses — we received feedback that it was now relieving for them to create their course with an Artificial Intelligence (AI) tool as they could now rapidly use their resources while having a clear expectation on the accuracy of their outputs.

Yet, our users also did not shy away to point out when they started encountering problems. We received messages that there was some odd and unhelpful content in the created courses — lengthy sentences along with the deviations in the accuracy of the generated information. As we hold ourselves to high standards of accountability, we appreciated the feedback and took it seriously. And now, we want to share a clear sense of what happened, why it matters, and the steps we’ve taken.

Generative AI landscape and Mini-Course Generator

For the past couple of years, we have witnessed that big companies such OpenAI, Meta, Google, and Anthropic have been leading the rapid developments regarding Generative AI (Gen AI), making the generation of multi-modal outputs, such as text, images, and audio to be notably accessible across various domains, spanning from education to entertainment. Similar to hundreds of other applications, we envisioned a near future where educational content creators can efficiently create resources by leveraging these technologies while also using their expertise to scale the creation of learning materials. With that naive road in mind, we have started with a feature, powered by established large language model (LLM) APIs as the sole knowledge source for the course creation. Along the way, our community’ voice highlighted that creating a course from their own resources would be beneficial, for which we utilized that day’s highest standard framework, namely, Retrieval Augmented Generation (RAG) system. While RAG systems were specifically designed to ground LLMs’ responses to a specialized external information source to supplement the LLMs’ internal knowledge base, ensuring its outputs to be more accurate and eventually trusted, we discovered that even these systems were not immune to the growing challenge of the ‘hallucinating’ incorrect or implausible information.

Figure 1. An overview of standard retrieval-augmentation generation framework. Given a user query/prompt, the typical process consists of two phases: retrieval and content generation. (1) During the retrieval phase, algorithms search for and retrieve chunks of information from the documents in the knowledge base relevant to the user’s prompt. (2) In the content generation phase, the retrieved texts are passed to the language model, which augment the retrieved texts with its internal training data to synthesize a response to the user’s prompt. Despite this method proposed as a potential solution to the hallucination problem in the generated responses, in practice, we saw that even RAG systems are not hallucination-free, and identified several challenges.

Figure 2. The detailed RAG Pipeline with re-ranking was intended to enhance the standard RAG framework by introducing an additional refinement layer between the initial retrieval and generation phases. The re-ranking component was designed to employ a semantic analysis to score and filter retrieved documents based on factors like query-document relevance, content quality, and cross-document coherence. This intermediate layer was meant to serve as a quality control mechanism to ensure only the most relevant and reliable context would reach the language model for response generation. However, in practice, our re-ranking mechanism proved inadequate in properly weighing and prioritizing relevant content. More specifically, the triple challenge of unreliable retrieval, limited semantic understanding, and ineffective re-ranking created conditions where the LLM must generate responses with inadequate or misaligned information, leading to increased likelihood of hallucinations in the output.

To be more specific, in our implementation, we encountered several significant technical challenges that highlighted the limitations of our RAG pipeline, typically due to the steps in between the retrieval and augmentation:

Vector database retrieval problems: Our initial experiments revealed significant challenges in the foundational retrieval phase, where the vector database consistently failed to retrieve contextually appropriate texts. Despite our best efforts at optimization, the semantic similarity scores were not reliably identifying the most relevant content, forcing the system to work with incomplete or inappropriate context. This fundamental retrieval issue significantly increased the risk of hallucinations in the generated responses.
Semantic relationship limitations: The root cause of these retrieval problems became clear as we discovered that relying on numerical representations (i.e., vectors) for semantic understanding was insufficient for capturing nuanced conceptual relationships, leading to misinterpretations and incorrect inferences in the generated responses.
Re-ranking issues: Our re-ranking mechanisms proved inadequate in properly weighing and prioritizing relevant content, leading to suboptimal content selection and organization. Our experience showed that re-ranking mechanisms struggled with properly balancing and prioritizing content relevance when dealing with complex prompts or diverse document collections. This resulted in information dilution, where marginally relevant content got prioritized over more pertinent information, ultimately affecting the quality of generated responses.

Figure 3. The term “AI’s hallucinations” emerged to pinpoint the occurrence of ‘inaccurate,’ ‘false,’ ‘out-of-context information’ within Gen AI and AI-incorporating systems’ outputs, and despite our implementation of RAG architecture, these issues persisted in our system. Subject matter experts as well as industry partners in the Gen AI community have been suggesting that hallucinations being an inherent tendency of LLMs and they cannot be stopped, at most, can be limited. (Image by Osarugue Igbinoba | Unsplash)

Given these challenges in implementing RAG in learning material generation, we decided to adopt a more pragmatic approach, focusing on title-based associations along with a consistency checker step that could better constrain potential extrinsic hallucinations and optimized our approach to the learning material creation process.

Evaluating factual consistency: An approach to managing hallucinations

Having identified the limitations of the RAG system we utilized and established our approach using title-based associations, we next focused on utilizing an additional method for managing hallucinations. Beginning with a review of both industry stakeholder experiences and current research on factual consistency verification, we identified several solutions to tackle the hallucination problem. Though there has not been a consensus on a crystal clear solution, we observed the feasibility of comparing generated text against its source material for similarity assessment. Although perfect fact-checking would require deep semantic understanding, we determined that additionally quantifying the alignment between generated text and its source could provide a proxy for factual consistency for our users, leading us to implement a systematic text comparison approach that converts textual information into numerical representations, enabling us to generate a factual consistency score as an automated measure of content fidelity to its source.

Our solution implemented this approach through three key steps:

Text cleaning, which prepares the content by removing stop-words (common words like “the,” “is,”” “at,” and “which” that carry little semantic meaning) and formatting inconsistencies (such as inconsistent capitalization, punctuation, and extra whitespace) to ensure meaningful comparison focuses on content-carrying terms.
Semantic encoding, which employs spaCy’s (an open-source natural language processing library) language model to transform the cleaned text into numerical representations to enable a mathematical comparison of textual content.
Similarity assessment, which applies cosine similarity calculations (a mathematical measure that determines how similar two vectors are by calculating the cosine of the angle between them, resulting in a score between 0 and 1, where 1 indicates perfect similarity) to quantify the alignment between source and generated content.

By leveraging natural language processing techniques, we targeted identifying semantic similarities even when different vocabulary or phrasing is used. As an illustrative example, consider a source text about the Renaissance period and its AI-generated explanation (see Figure 4). The method can recognize semantic alignment despite variations in phrasing (such as “cultural rebirth” versus “cultural revival” and “classical art” versus “ancient Greek and Roman works”), while still identifying potential factual inconsistencies. This demonstrates the approach’s potential for automated consistency checking in educational content generation, particularly when creating explanations that need to maintain factual accuracy while adapting to different comprehension levels.

Figure 4. On the left side, you see a source passage from a course material: “The Renaissance was a period of cultural rebirth in Europe that began in Italy during the 14th century. This movement was characterized by renewed interest in classical art and learning, leading to significant advances in art, architecture, and science.” The right side shows the AI-generated content when prompted with “Explain the key characteristics of the Renaissance period in simple terms.” The generated output is as follows: “The Renaissance started in Italy in the 1300s and was a time when European culture experienced a major revival. People became very interested in studying ancient Greek and Roman works, which sparked big developments in things like painting, building design, and scientific discovery.” The abovementioned approach generated a similarity score of 0.8996, indicating strong factual alignment with the source text while maintaining accessible language. You can visit our method’s scaled version to try it with basic inputs and get a glimpse of what’s happening behind the scenes of our course creation process. You can also examine our implementation details in our GitHub repository.

Improvements and ongoing challenges

Building on our approach of combining title-based associations with textual similarity measurement, we then designed a testing strategy for our integrated solution. This strategy involved examining both typical educational content and intentionally challenging edge cases, simulating various user interaction patterns and thought processes to assess the system’s effectiveness. While we can speculate about how users might interact with this combined approach based on their needs, existing knowledge, and states of mind, we recognize that real-world implementation will provide the true test of our solution’s value in managing hallucinations.

To provide evidence of our integrated approach’s effectiveness, we compiled a snippet of initial test results showing similarity scores across different types of source materials and their corresponding AI-generated course versions in Table 1. As shown, our combined title-based retrieval and consistency checking approach demonstrated measurable performance across varying content complexity and subject matter.

Original source (pdf version)	Course link	Consistency Score (0–1)
Libre Office	Basics of Libre Office	0.9827
Aspirin Instructions	Aspirin’s Instructions for Use	0.9886
Stories from the Bible	Ten Stories from the Bible	0.9361
Residency Guide	E-Residency Guide	0.9882
World Economic Outlook	World Economic Outlook Update on Global Growth	0.9891

Table 1. As shown in the similarity scores above, our approach demonstrates potential across diverse content types, from technical documentation to economic reports. However, we are cognizant that while this method provides a quick, automated way to compare texts for factual consistency, making it useful for fast-checking, content validation, and evaluating AI-generated text, it does not truly understand facts — it measures semantic similarity, meaning it can detect shifts in wording but may still miss deeper inaccuracies or subtle misinformation.

Though our proposed integrated solution won’t completely eliminate inconsistencies, initial testing results suggest it offers a feasible and measurable step forward in identifying factual consistencies in generated texts and their original sources. The combination of title-based associations and our systematic consistency checking works complementarily — while the former improves relevant content retrieval the latter checks the factual alignment, showing particular promise for learning material generation and addressing the limitations we encountered with traditional RAG approaches.

Wrap up

As Mini Course Generator, we have taken systematic steps to address the challenge of hallucinations in AI-generated educational content. Our journey evolved from implementing a traditional RAG system to developing a more focused approach combining title-based associations with a quantitative consistency checking mechanism. This integrated solution enables us to measure semantic similarity across diverse content types while maintaining closer control over content retrieval and generation.

While our initial test results demonstrate measurable improvements in content accuracy, we acknowledge that this represents only the beginning of our efforts to enhance AI-generated content reliability. Our commitment to delivering accurate, trustworthy learning materials drives us to continuously evaluate and refine our approach, maintaining transparency about its current limitations. As we continue developing our methods, we value constructive feedback that helps improve both our technical implementation and our accountability to the educational communities we learn and grow with. And, this dedication to reliable knowledge creation remains central to our ongoing development of AI-assisted learning content generation.

References

Anthropic. (2025, February 24). Claude 3.7 Sonnet. Retrieved February 26, 2025, from https://www.anthropic.com/ne…

Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this. arXiv preprint arXiv:2409.05746.

Explosion AI. (n.d.). spaCy: Industrial-strength Natural Language Processing in Python. https://spacy.io/

Google. (2024, December 11). Gemini 2.0. Retrieved February 18, 2025, from https://blog.google/te…/

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., … & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2021). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

Martineau, K. (2023, August 22). What is retrieval-augmented generation? IBM Research Blog. Retrieved February 12, 2025, from https://research.ibm.com/b..

Mendelevitch, O., Bao, F., Li, M., & Luo, R. (2024, August 5). HHEM 2.1: A better hallucination detection model and a new leaderboard. Vectara. Retrieved February 13, 2025, from https://www.vectara.com/bl..

Meta. (2024, April 18). Llama 3. Retrieved February 18, 2025, from https://ai.meta.com/blog/meta-llama-3/

Nicola, J. (2025, January 21). AI hallucinations can’t be stopped — but these techniques can limit them. Nature. Retrieved February 23, 2025, from https://www.nature.com/arti..

OpenAI. (2024, May 13). GPT-4o. Retrieved February 18, 2025, from https://openai.com/index/hello-gpt-4o/

Statista Research Department. (2024, November 12). AI software total product count in 2024. Statista. Retrieved February 20, 2025, from https://www.statista.com/stati..