Artificial Intelligence Hallucinations:
Managing Content Reliability in Automated Learning Material Generation
March 2025
Initial overview
Dating back to March 2024, we announced that we’d be bringing the functionality of creating courses from any PDF document to our community.
In the first couple of weeks of our release, our users had high satisfaction with their courses — we received feedback that it was now relieving for them to create their course with an Artificial Intelligence (AI) tool as they could now rapidly use their resources while having a clear expectation on the accuracy of their outputs.
Yet, our users also did not shy away to point out when they started encountering problems. We received messages that there was some odd and unhelpful content in the created courses — lengthy sentences along with the deviations in the accuracy of the generated information. As we hold ourselves to high standards of accountability, we appreciated the feedback and took it seriously. And now, we want to share a clear sense of what happened, why it matters, and the steps we’ve taken.
Generative AI landscape and Mini-Course Generator
For the past couple of years, we have witnessed that big companies such OpenAI, Meta, Google, and Anthropic have been leading the rapid developments regarding Generative AI (Gen AI), making the generation of multi-modal outputs, such as text, images, and audio to be notably accessible across various domains, spanning from education to entertainment. Similar to hundreds of other applications, we envisioned a near future where educational content creators can efficiently create resources by leveraging these technologies while also using their expertise to scale the creation of learning materials. With that naive road in mind, we have started with a feature, powered by established large language model (LLM) APIs as the sole knowledge source for the course creation. Along the way, our community’ voice highlighted that creating a course from their own resources would be beneficial, for which we utilized that day’s highest standard framework, namely, Retrieval Augmented Generation (RAG) system. While RAG systems were specifically designed to ground LLMs’ responses to a specialized external information source to supplement the LLMs’ internal knowledge base, ensuring its outputs to be more accurate and eventually trusted, we discovered that even these systems were not immune to the growing challenge of the ‘hallucinating’ incorrect or implausible information.


To be more specific, in our implementation, we encountered several significant technical challenges that highlighted the limitations of our RAG pipeline, typically due to the steps in between the retrieval and augmentation:
- Vector database retrieval problems: Our initial experiments revealed significant challenges in the foundational retrieval phase, where the vector database consistently failed to retrieve contextually appropriate texts. Despite our best efforts at optimization, the semantic similarity scores were not reliably identifying the most relevant content, forcing the system to work with incomplete or inappropriate context. This fundamental retrieval issue significantly increased the risk of hallucinations in the generated responses.
- Semantic relationship limitations: The root cause of these retrieval problems became clear as we discovered that relying on numerical representations (i.e., vectors) for semantic understanding was insufficient for capturing nuanced conceptual relationships, leading to misinterpretations and incorrect inferences in the generated responses.
- Re-ranking issues: Our re-ranking mechanisms proved inadequate in properly weighing and prioritizing relevant content, leading to suboptimal content selection and organization. Our experience showed that re-ranking mechanisms struggled with properly balancing and prioritizing content relevance when dealing with complex prompts or diverse document collections. This resulted in information dilution, where marginally relevant content got prioritized over more pertinent information, ultimately affecting the quality of generated responses.

Given these challenges in implementing RAG in learning material generation, we decided to adopt a more pragmatic approach, focusing on title-based associations along with a consistency checker step that could better constrain potential extrinsic hallucinations and optimized our approach to the learning material creation process.
Evaluating factual consistency: An approach to managing hallucinations
Having identified the limitations of the RAG system we utilized and established our approach using title-based associations, we next focused on utilizing an additional method for managing hallucinations. Beginning with a review of both industry stakeholder experiences and current research on factual consistency verification, we identified several solutions to tackle the hallucination problem. Though there has not been a consensus on a crystal clear solution, we observed the feasibility of comparing generated text against its source material for similarity assessment. Although perfect fact-checking would require deep semantic understanding, we determined that additionally quantifying the alignment between generated text and its source could provide a proxy for factual consistency for our users, leading us to implement a systematic text comparison approach that converts textual information into numerical representations, enabling us to generate a factual consistency score as an automated measure of content fidelity to its source.
Our solution implemented this approach through three key steps:
- Text cleaning, which prepares the content by removing stop-words (common words like “the,” “is,”” “at,” and “which” that carry little semantic meaning) and formatting inconsistencies (such as inconsistent capitalization, punctuation, and extra whitespace) to ensure meaningful comparison focuses on content-carrying terms.
- Semantic encoding, which employs spaCy’s (an open-source natural language processing library) language model to transform the cleaned text into numerical representations to enable a mathematical comparison of textual content.
- Similarity assessment, which applies cosine similarity calculations (a mathematical measure that determines how similar two vectors are by calculating the cosine of the angle between them, resulting in a score between 0 and 1, where 1 indicates perfect similarity) to quantify the alignment between source and generated content.
By leveraging natural language processing techniques, we targeted identifying semantic similarities even when different vocabulary or phrasing is used. As an illustrative example, consider a source text about the Renaissance period and its AI-generated explanation (see Figure 4). The method can recognize semantic alignment despite variations in phrasing (such as “cultural rebirth” versus “cultural revival” and “classical art” versus “ancient Greek and Roman works”), while still identifying potential factual inconsistencies. This demonstrates the approach’s potential for automated consistency checking in educational content generation, particularly when creating explanations that need to maintain factual accuracy while adapting to different comprehension levels.

Improvements and ongoing challenges
Building on our approach of combining title-based associations with textual similarity measurement, we then designed a testing strategy for our integrated solution. This strategy involved examining both typical educational content and intentionally challenging edge cases, simulating various user interaction patterns and thought processes to assess the system’s effectiveness. While we can speculate about how users might interact with this combined approach based on their needs, existing knowledge, and states of mind, we recognize that real-world implementation will provide the true test of our solution’s value in managing hallucinations.
To provide evidence of our integrated approach’s effectiveness, we compiled a snippet of initial test results showing similarity scores across different types of source materials and their corresponding AI-generated course versions in Table 1. As shown, our combined title-based retrieval and consistency checking approach demonstrated measurable performance across varying content complexity and subject matter.
Original source (pdf version) | Course link | Consistency Score (0–1) |
Libre Office | Basics of Libre Office | 0.9827 |
Aspirin Instructions | Aspirin’s Instructions for Use | 0.9886 |
Stories from the Bible | Ten Stories from the Bible | 0.9361 |
Residency Guide | E-Residency Guide | 0.9882 |
World Economic Outlook | World Economic Outlook Update on Global Growth | 0.9891 |
Table 1. As shown in the similarity scores above, our approach demonstrates potential across diverse content types, from technical documentation to economic reports. However, we are cognizant that while this method provides a quick, automated way to compare texts for factual consistency, making it useful for fast-checking, content validation, and evaluating AI-generated text, it does not truly understand facts — it measures semantic similarity, meaning it can detect shifts in wording but may still miss deeper inaccuracies or subtle misinformation.
Though our proposed integrated solution won’t completely eliminate inconsistencies, initial testing results suggest it offers a feasible and measurable step forward in identifying factual consistencies in generated texts and their original sources. The combination of title-based associations and our systematic consistency checking works complementarily — while the former improves relevant content retrieval the latter checks the factual alignment, showing particular promise for learning material generation and addressing the limitations we encountered with traditional RAG approaches.
Wrap up
As Mini Course Generator, we have taken systematic steps to address the challenge of hallucinations in AI-generated educational content. Our journey evolved from implementing a traditional RAG system to developing a more focused approach combining title-based associations with a quantitative consistency checking mechanism. This integrated solution enables us to measure semantic similarity across diverse content types while maintaining closer control over content retrieval and generation.
While our initial test results demonstrate measurable improvements in content accuracy, we acknowledge that this represents only the beginning of our efforts to enhance AI-generated content reliability. Our commitment to delivering accurate, trustworthy learning materials drives us to continuously evaluate and refine our approach, maintaining transparency about its current limitations. As we continue developing our methods, we value constructive feedback that helps improve both our technical implementation and our accountability to the educational communities we learn and grow with. And, this dedication to reliable knowledge creation remains central to our ongoing development of AI-assisted learning content generation.
References
Anthropic. (2025, February 24). Claude 3.7 Sonnet. Retrieved February 26, 2025, from https://www.anthropic.com/ne…
Banerjee, S., Agarwal, A., & Singla, S. (2024). LLMs will always hallucinate, and we need to live with this. arXiv preprint arXiv:2409.05746.
Explosion AI. (n.d.). spaCy: Industrial-strength Natural Language Processing in Python. https://spacy.io/
Google. (2024, December 11). Gemini 2.0. Retrieved February 18, 2025, from https://blog.google/te…/
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., … & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2021). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
Martineau, K. (2023, August 22). What is retrieval-augmented generation? IBM Research Blog. Retrieved February 12, 2025, from https://research.ibm.com/b..
Mendelevitch, O., Bao, F., Li, M., & Luo, R. (2024, August 5). HHEM 2.1: A better hallucination detection model and a new leaderboard. Vectara. Retrieved February 13, 2025, from https://www.vectara.com/bl..
Meta. (2024, April 18). Llama 3. Retrieved February 18, 2025, from https://ai.meta.com/blog/meta-llama-3/
Nicola, J. (2025, January 21). AI hallucinations can’t be stopped — but these techniques can limit them. Nature. Retrieved February 23, 2025, from https://www.nature.com/arti..
OpenAI. (2024, May 13). GPT-4o. Retrieved February 18, 2025, from https://openai.com/index/hello-gpt-4o/
Statista Research Department. (2024, November 12). AI software total product count in 2024. Statista. Retrieved February 20, 2025, from https://www.statista.com/stati..