Lessons Learned: Efficient Retrieval-Augmented Generation (RAG) edit01

Lessons Learned: Efficient Retrieval-Augmented Generation (RAG)

As developers working with AI, we often balance innovation and practicality. Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with document retrieval to create accurate and context-aware responses. Here are some of the lessons learned while implementing a RAG system. This blog post is mainly based on code I have written to experiment with RAG concepts and find out what works and what does not. You can find the code here.

General program flow

The document embedding flow and preprocessing of documents/metadata can be described as follows;

The process to answer questions based on data in the vector DB can be described using the following diagram.

Lessons Learned: Efficient Retrieval-Augmented Generation (RAG) rag.drawio 1

Initial setup

The code uses HTML documents from a crawled_content folder. This data is scraped from a website with the following script here. If you want to completely run the sample code on your own machine, you need to download ollama here and execute ollama pull mistral:instruct and ollama pull mixtral:8x22b-instruct-v0.1-q3_K_S (mind this requires 64Gb of RAM, choose a different model is your machine has lower specs such as this model of which quantized versions can run on 32Gb). When first running the script, the embedding model mixedbread-ai/mxbai-embed-large-v1 is pulled from Hugging Face resulting in a completely local set-up. Before running the script install required dependencies (pip install openai llama-index chromadb beautifulsoup4). Python 3.11 or similar is expected.


Role of the different models used (Hugging Face references):

  • mistralai/Mixtral-8x22B-Instruct-v0.1
    This is a general purpose LLM for evaluating relevance of documents, generating and answering subquestions and the main question based on answers of the subquestions.
  • mixedbread-ai/mxbai-embed-large-v1
    Model used for generating vectors from documents and retrieval questions. Chroma is used to perform semantic queries on the generated vectors based on a query vector.
  • mistralai/Mistral-7B-Instruct-v0.2
    Small model used for generating titles which are saved as metadata in the vector DB. This model is small and light (7B) thus quite fast. This helps, especially when there are a lot of documents for which this needs to happen.

Model hosting considerations

Embeddings convert text into numerical representations, ensuring high-quality vector representations that are essential for accurate document retrieval. Using specialized models for embeddings significantly enhances the relevance of retrieved documents, directly impacting the effectiveness of the Retrieval-Augmented Generation (RAG) system.

When loading a model, you can use Ollama to host it and communicate via the OpenAI-compliant API. This works well for large language models (LLMs) but is less effective for models specialized in generating embeddings (converting documents to vectors for semantic searching). Currently, Ollama does not support an OpenAI-compliant embeddings API, meaning you cannot use OpenAI libraries for this purpose. Additionally, the OpenAI library offers a fixed list of available embedding models.

I encountered issues with retrieving documents semantically related to my queries, likely due to limitations on the Ollama side (I tried this with both LlamaIndex and LangChain libraries). The requests to the embedding model were relatively small, but there was significant overhead in hosting the model and using the API from the code. Eventually, I decided to use the Hugging Face libraries to host the embedding model directly within my code. This not only yielded better results but also significantly improved performance.

Finding an embedding model that works for you is not as straightforward as it may seem. You often face the following challenges:

  1. The current Hugging Face library (and by extension, the LlamaIndex Hugging Face library) does not support GGUF embedding models. GGUF models are quantized, significantly smaller, and require less hardware to run.
  2. The best models often require substantial resources and have long inference times. For example, Salesforce/SFR-Embedding-2_R requires considerable RAM and is relatively heavy. In contrast, mixedbread-ai/mxbai-embed-large-v1 performs slightly worse but requires less than a tenth of the resources. This distinction is crucial in cloud environments where you pay for resource usage. When you need to vectorize many documents, mixedbread-ai/mxbai-embed-large-v1 is also much less CPU-intensive.
  3. Most models have specific usage instructions to take into account. For example vector length, length of document chunks they can convert at a time, specific instruction templates for queries, just to mention a few. Thus switching from one model to the other likely will require some specific configuration to make it work.

Processing documents


The script processes documents by reading and preprocessing them, including parsing HTML files to extract clean text. Techniques include cleaning HTML content, removing duplicate lines, and filtering out irrelevant information to ensure only the most pertinent text is used for embedding. The preprocessing step also enhances the documents with metadata, such as generated titles using an LLM and HTML titles, which improves retrieval performance.

Semantic splitting

Splitting documents into smaller, semantically coherent chunks ensures precise embeddings and efficient retrieval. Known as semantic splitting, this approach breaks down documents into manageable pieces that retain their contextual richness. Splitting documents half-way a sentence or paragraph can cause loss of meaning and by semantic splitting you can avoid this.


To avoid the inefficiency of repeatedly loading and unloading models or keeping multiple models active in memory (which is not easily possible with ollama), batching is implemented. Processing documents and queries in batches ensures that models are loaded once per batch, processed in bulk, and then unloaded. This approach optimizes resource usage and improves processing efficiency, especially when using multiple models for specific processing tasks.

Performing semantic search and validating results

The processed documents, along with their embeddings and metadata, are indexed into a vector database. When a query is made, the system transforms the query to improve retrieval relevance and generates embeddings to find matching document chunks in the database. Retrieved documents are validated for relevance using the LLM (Mixtral 8x22B). Semantic matches in a vector database do not necessarily mean the information is relevant for answering a query. Using LLMs to validate the relevance of retrieved documents ensures only relevant information is considered. This ensures efficient use of LLM context.

Divide and conquer

The main query is broken down into subquestions, with the script retrieving and validating documents for each subquestion. The answers are combined, and the LLM generates a comprehensive final response. This diverge/converge approach ensures the RAG system delivers accurate, contextually rich responses efficiently.


To succeed with Retrieval-Augmented Generation (RAG), focus on query transformation, embedding optimization, data preprocessing, multi-stage processing, and data quality. This blog offers practical tips to enhance your RAG systems, aiming to help you deliver precise, contextually rich responses and ensure optimal performance in AI applications.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.