RAG Approach to Domain-Specific Question Answering

RAG Approach to Domain-Specific Question Answering

Mar 28 ·
41 Min Read
Azure RAG AI Architecture Azure AI Search Generative AI Vector Search Enterprise AI

Retrieval-Augmented Generation System Architecture with Azure AI Search

Retrieval-Augmented Generation (RAG) is an architecture that combines a Large Language Model (LLM) with an external information retrieval system to ground the model’s answers on specific data (RAG and generative AI - Azure AI Search | Microsoft Learn). In our context, the goal is to answer questions about a company’s internal policies (e.g. cloud security measures) by augmenting an LLM (such as GPT-4) with the company’s private documents (employee handbooks, policy documents, etc.). By using RAG, the LLM’s responses will be based on actual internal content, ensuring accuracy and preventing hallucinations (RAG and generative AI - Azure AI Search | Microsoft Learn).

This solution will leverage Microsoft Azure’s services for a robust, enterprise-ready implementation. We use Azure AI Search (formerly Azure Cognitive Search) as the core of our retrieval system for indexing and querying policy documents. Azure AI Search is well-suited for RAG because it supports semantic search and vector similarity search, along with Azure’s security and scale (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). In addition, Azure OpenAI Service provides the LLM (GPT-4) and embedding models needed. Azure Blob Storage will store the source documents, and Azure Functions will be used to orchestrate processes like data ingestion or to implement custom logic. The end result is an architecture where a user’s question is answered by GPT-4 using relevant excerpts from internal policies, retrieved on-the-fly from an Azure AI Search index.

System Architecture Overview

Key Components and Azure Services:

All these components work together to form the RAG system. The architecture ensures that when a user asks a question, the system retrieves the most relevant policy content and feeds it into GPT-4, which then crafts a domain-specific answer. The following sections describe this process in detail: from how documents are ingested and indexed, to how queries are processed and answered, and how Azure’s search capabilities compare to open-source alternatives.

Document Ingestion and Indexing Process

Ingestion is the process of taking raw internal documents and preparing them for efficient retrieval. This involves parsing files, splitting them into retrievable chunks, generating vector embeddings, and storing everything in the Azure AI Search index. The ingestion pipeline can be implemented using Azure AI Search indexers with a skillset or via custom code (e.g. Azure Functions). Key steps in the ingestion process include:

  1. Document Acquisition: The pipeline is triggered to ingest documents, either by pulling from a data source or pushing documents into the index. In our case, the data source is Azure Blob Storage containing policy documents. We can set up an Azure AI Search indexer connected to the blob container so that it automatically finds new or updated files. Alternatively, an Azure Function can watch the blob storage for new files (using Event Grid) and initiate processing. This flexibility allows content to be loaded or refreshed at the required frequency (RAG and generative AI - Azure AI Search | Microsoft Learn). Azure Search supports both push APIs and indexers; an indexer simplifies ingestion by automatically retrieving blobs and parsing their content (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium) (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium).

  2. Document Parsing: Each document is parsed to extract its text content. Azure AI Search indexers have built-in document cracking for common formats (like PDF, DOCX, HTML, etc.), so they can extract text from these files without custom code (Choosing the right Azure Vector Database | by Michael John Peña | Medium). If documents are scanned images or need complex processing, Azure Cognitive Services (like Form Recognizer) or a custom Azure Function can be used to perform OCR and text extraction. The output of this step is the raw text of each document (often with basic metadata like filename or path).

  3. Text Chunking: The document text is split into smaller chunks or passages. Chunking is critical for RAG because LLMs have input length limits and because retrieving at a fine granularity improves accuracy. We aim for each chunk to be a semantically coherent unit (e.g. a paragraph, section, or answer to a single question) and to fit within embedding and prompt size limits. A common strategy is to use a fixed-size sliding window: for example, chunks of ~200-500 words with a 10-15% overlap so that context isn’t lost between chunks (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Another strategy is logical chunking by document structure – for instance, using headings or sentence boundaries to split content where appropriate (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Azure AI Search provides a Text Split skill that can split text by paragraphs, sentences, or by character length, including overlap configuration (Chunk documents in vector search - Azure AI Search | Microsoft Learn). It also offers a Document Layout skill which can chunk by semantic sections (using headings in PDFs/Word, etc.) (Chunk documents in vector search - Azure AI Search | Microsoft Learn). By chunking the content, we ensure each piece focuses on a single topic or point, which improves the relevance of search results and stays under the token limit of embedding models (Chunk documents in vector search - Azure AI Search | Microsoft Learn). (For reference, the OpenAI Ada embedding model allows up to ~8191 tokens input (Chunk documents in vector search - Azure AI Search | Microsoft Learn), but smaller chunks tend to yield better retrieval performance than one huge embedding (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub).)

  4. Embedding Generation: For each text chunk, we generate a vector embedding – a numeric representation of the chunk’s semantic content. We typically use Azure OpenAI’s embedding model (such as text-embedding-ada-002) for this. The embedding is a 1536-dimensional vector (for Ada-002) capturing the meaning of the text. If using Azure AI Search skillsets, we can add an AzureOpenAIEmbedding skill in the pipeline, which will call the OpenAI service to embed each chunk automatically (Integrated vectorization - Azure AI Search | Microsoft Learn) (Integrated vectorization - Azure AI Search | Microsoft Learn). Alternatively, an Azure Function or external script can call the OpenAI API to get embeddings and then attach them to the chunk data. This step vectorizes the content: the text “Cloud data is encrypted at rest using AES-256” becomes a vector in high-dimensional space. Azure AI Search does not create embeddings on its own – we supply them (via skillset or API) before indexing (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The output of this step is an embedding vector for each chunk, which will be stored in the search index.

  5. Enrich Metadata: (Optional but recommended) We attach relevant metadata to each chunk. Metadata can include the source document ID or title, section headings, or tags like document category (e.g., “Security Policy”). We might also generate a summary of the chunk or extract keywords as metadata fields (Azure Search’s cognitive skills can do key phrase extraction, if needed). This metadata is stored alongside the chunk and can be used for filtering and for providing context in the answer. For example, a chunk may have fields: content (the text), content_vector (the embedding), doc_title (“Cloud Security Policy”), section (“Encryption at Rest”), etc. The enrichment step ensures each chunk carries the necessary context to trace it back to source and to help rank it properly (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn).

  6. Indexing and Storage: Finally, the processed chunk (text + embedding + metadata) is persisted in an Azure AI Search index. Each chunk becomes a document in the search index. The index schema defines fields for the content, the vector, and metadata. For example, we define a field content of type Edm.String (searchable text), and a field content_vector of type Collection(Edm.Single) with dimension 1536 (for the embedding) and an attached vector search profile (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). This vector field is marked as searchable and retrievable, enabling similarity queries. The index is configured to allow vector search on content_vector (using Azure’s approximate k-NN algorithm, HNSW by default) and full-text search on content (using the BM25 algorithm by default). Azure AI Search supports defining a vector index with either HNSW (for efficient ANN search) or an exhaustive KNN search for smaller datasets (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The index can also enable semantic ranking on the content (more on that shortly). If using an indexer with skillset, this indexing step happens automatically after enrichment: the indexer writes each enriched chunk to the search index (Integrated vectorization - Azure AI Search | Microsoft Learn) (Integrated vectorization - Azure AI Search | Microsoft Learn). If using a custom pipeline, the Azure Function would use the Azure Search SDK/REST API to upload documents in batches (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). Once indexing is complete, the content is ready for retrieval.

At the end of ingestion, Azure AI Search holds a secure, searchable index of the internal policy data. Each original document has been broken into many smaller indexed chunks, each with an associated embedding vector. This structure allows the system to retrieve just the relevant pieces of documents when a question is asked, rather than an entire file. The ingestion process can be scheduled or continuous. For instance, an indexer can run periodically or be triggered on new file uploads, so the index stays up-to-date with the latest policies. Azure AI Search is built for scale, capable of indexing millions of documents and handling frequent updates, which is important if the company has a large or evolving document set (RAG and generative AI - Azure AI Search | Microsoft Learn).

One of the powerful features we enable in Azure AI Search is semantic search, which in Azure’s terminology refers to an advanced AI-driven ranking of search results (previously called the semantic ranker). Semantic search improves result quality by understanding the intent of the query and the meaning of the documents, rather than relying only on keyword matching. Here’s how we configure and use it:

In summary, semantic search in Azure AI Search is configured to understand the user’s query and the document meaning, giving us highly relevant passages. It works hand-in-hand with vector search in our architecture to retrieve the best possible context for the LLM.

Retrieval and Response Pipeline

Once the documents are indexed and the search service is ready, the runtime query flow (the retrieval and answer generation pipeline) works as follows. This is the core of how a user’s question gets turned into an answer with supporting data:

  1. User Question Submission: A user (e.g. an employee) asks a question through a client application or chatbot interface. For instance, the user might ask: “What are our policies for encrypting customer data in the cloud?”. This question is sent to the backend of our system for processing.

  2. Orchestrator Receives Query: The question is received by an orchestrator component. This could be a dedicated backend service, or an Azure Function set up for this purpose, or even logic in the client – but typically a server-side component handles it. The orchestrator’s job is to coordinate between Azure AI Search and the Azure OpenAI GPT-4 model. Using an orchestrator (which could be implemented with frameworks like Semantic Kernel, LangChain, or custom code) gives us control and ability to add business logic. In Azure’s reference architecture, the orchestrator decides which type of search to perform (keyword, vector, or hybrid) and how to construct the prompt (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn) (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn).

  3. Formulate Search Query: The orchestrator formulates a query to Azure AI Search to retrieve relevant document chunks. Given our configuration, it will likely use a hybrid query with vector and keyword. Concretely, there are a couple of ways to do this:

  4. Retrieve Relevant Chunks: Azure AI Search executes the query and returns the top N results (chunks). Each result includes the chunk text (content) and any metadata (like doc_title and maybe a search score). Thanks to our indexing and search configuration:

    We typically take the top 3-5 chunks from the search results as our context. (The number can be tuned based on observed answer quality and the length of chunks; we want enough coverage to answer the question but not so much text that we risk overloading the prompt or diluting the relevance.)

  5. Compose LLM Prompt with Retrieved Context: The orchestrator now prepares a prompt for GPT-4 that includes the user’s question and the retrieved context passages. One effective prompt structure is:

    • System message: A directive to the assistant, e.g. “You are an AI assistant that provides answers based on company policy documents. Answer the question using the provided information, and if the answer is not in the documents, say you don’t know.” This sets the stage and ensures the model sticks to the sources.
    • User message: This contains the user’s actual question and the context. For example, it may say: User Question: What are our policies for encrypting customer data in the cloud? Relevant Excerpts: 1. [Excerpt from Cloud Security Policy]: ‘All customer data stored in cloud systems must be encrypted at rest using at least AES-256 encryption…’ 2. [Excerpt from Data Handling Guidelines]: ‘Customer data in transit should be encrypted using TLS 1.2 or higher…’”. We list each retrieved chunk, possibly prefaced by a title or source identifier for clarity. The prompt is constructed carefully to fit within the model’s token limit (for GPT-4 8k or 32k context, depending on model version). Since our chunks were sized with token limits in mind, and we only include top few, this should be within the limit.

    The model is thereby given all the information it needs: the question and the supporting content from internal docs. In some implementations, we also add a request like “Please cite the source of your answer based on the provided excerpts.” if we want the model to indicate which document the info came from. (The model can do this by referring to the excerpt numbers or metadata we provided, though formatting citations is something we might handle post-response as well.)

  6. LLM (GPT-4) Generates Answer: We send the prompt to the Azure OpenAI GPT-4 model (via the Chat Completion API) and get back a response. Because the model has access to specific policy text, it will ground its answer on that. For example, GPT-4 might respond: “According to the Cloud Security Policy, all customer data in the cloud is encrypted at rest with at least AES-256. Additionally, the Data Handling Guidelines state that data in transit must use TLS 1.2 or above for encryption. This ensures that customer data is protected both when stored and when transmitted.” The answer will essentially be a summary or direct explanation of the retrieved excerpts, phrased in a helpful way for the user. The model’s role is to synthesize and present the information in the documents, using its natural language capabilities, but not to introduce new factual claims. By constraining it to the provided context, we maintain accuracy (the model is controlled by grounding data from the enterprise content (RAG and generative AI - Azure AI Search | Microsoft Learn)).

  7. Response Delivery: The orchestrator (or Azure OpenAI service, if using the built-in integration) returns the answer to the user. The user sees the final answer, and potentially we can also show the sources. For example, we might display a couple of snippets or a reference like “(Source: Cloud Security Policy)” so the user trusts the answer. This is a design choice – the system can just give the answer text, or include citations for transparency. Since the question is about internal policies, often the user might want to know which document the answer came from. Our metadata (like doc_title or a document ID per chunk) enables us to trace back and present a citation. Many RAG implementations provide these source links to build user trust.

Throughout this pipeline, the system ensures only authorized content is retrieved (Azure Search can enforce ACLs if needed so that, e.g., HR policies aren’t shown to engineers if not permitted – though for general company policies it may be all open to employees). The combination of Azure AI Search for retrieval and GPT-4 for generation results in a powerful Q&A system: the search brings the relevant policy text (something GPT-4 would not have seen in training because it’s internal), and GPT-4 eloquently answers using that text. If a query cannot be answered from the documents (say the information isn’t in the policies), the system can detect that (e.g., if search returns no good hits or the model isn’t given any context) and respond with an appropriate message like “I’m sorry, I couldn’t find that information.”

This retrieval-response flow happens in seconds. Azure AI Search is optimized for fast queries over indexed data, and Azure OpenAI can generate answers in a few seconds for a few hundred tokens of output. The design also cleanly separates responsibilities: if tomorrow the company adds a new policy or updates an existing one, re-indexing that content will automatically make it available for the next questions, without needing to retrain any model. This is a key advantage of the RAG approach.

Search Methods in Azure AI Search: Vector, Semantic, and Hybrid

Azure AI Search provides multiple search techniques that we leverage in this RAG architecture. It’s important to understand these methods and when to use each:

To summarize the trade-offs: Vector search gives broad semantic recall, Keyword search gives precise lexical matching, Semantic search gives intelligent relevance ranking, and Hybrid search gives a balance of recall and precision by leveraging both vector and keyword (and can be further improved with semantic rerank). By using hybrid retrieval with semantic ranking in Azure AI Search, our system captures the strengths of each method – ensuring the LLM gets the most relevant and comprehensive context to answer user questions accurately. This approach is in line with current best practices for RAG systems on Azure: experiments confirm that chunked content + hybrid retrieval + semantic reranking yields significantly higher quality in grounded answers (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub).

Azure Solution vs. Open-Source Alternatives

It’s valuable to compare this Azure-based RAG architecture with alternative solutions using open-source or other technologies for document indexing, vector storage, and semantic retrieval. Here we highlight how Azure’s managed services stack up against some popular options:

In conclusion, Azure’s RAG architecture provides an end-to-end, integrated solution: Azure AI Search acts as both a vector store and search index, with built-in AI enrichment, and Azure OpenAI provides world-class language models. This tightly integrated approach is especially practical for enterprise scenarios where security, reliability, and support are important. Open-source alternatives like Elasticsearch, Qdrant, Weaviate, or Vespa can absolutely be used to build a similar system, and in some cases might offer more flexibility or avoid vendor lock-in. However, they would require more engineering work to achieve the same level of functionality. As one analysis put it: “Azure Cognitive Search is the one-stop shop if you want to go all-in on vectors on Azure… Elastic is a like-to-like contender if you manage it yourself, and for specialized vector operations you could host Pinecone, Milvus, Qdrant, etc.” (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The choice often comes down to where your data and apps already reside, and how much you value a managed service. For a company already using Azure (and given the need to keep internal data secure within a trusted cloud), leveraging Azure AI Search with Azure OpenAI is an excellent fit.

Conclusion

The described Retrieval-Augmented Generation system architecture provides a powerful way to harness internal company knowledge with the capabilities of generative AI. By using Azure AI Search to index and retrieve document chunks, and Azure OpenAI’s GPT-4 to generate answers, the system can deliver accurate, context-aware responses to users’ questions about internal policies. We covered how documents flow through an ingestion pipeline (from Blob storage, through parsing, chunking, and embedding, into the search index) and how at query time the relevant pieces of information are fetched (using vector, semantic, and hybrid search techniques) and fed into the LLM to produce a grounded answer. Key Azure services – Search, OpenAI, Blob, Functions – work in concert to achieve this, each handling what it’s best at (search/indexing, language understanding, storage, and glue logic respectively).

This RAG approach ensures answers are not just fluent, but also correct and specific to the company’s data. Compared to a standalone LLM, it dramatically reduces hallucinations and allows the system to answer questions that the base model wouldn’t know (since it injects proprietary knowledge). We also discussed how Azure’s solution compares to open-source solutions, noting that while one could assemble the pieces oneself, Azure provides an integrated and enterprise-ready path. In practice, the architecture can be expanded or adapted – for example, adding Azure AD authentication so that the search results respect user permissions, or using Azure Monitor to log and analyze which questions are asked and which documents are used (helpful for improving the system). But at its core, the architecture remains: Index your knowledge, retrieve relevant context, and let the LLM answer using that context.

By following this architecture, the company can deploy a virtual assistant that expertly answers cloud security policy questions (or any internal policy queries), with the confidence that the answers are grounded in the latest official documentation. This not only saves time for employees seeking information but also ensures consistent and accurate messaging aligned with the company’s policies. It’s a prime example of how AI can be safely and effectively leveraged in a domain-specific way by combining strengths of information retrieval and generation.

Sources: The design and recommendations above are based on Microsoft’s documentation and best practices for Azure Cognitive Search and RAG, as well as industry comparisons. Azure AI Search is recommended for RAG due to its integrated vector and hybrid search capabilities and seamless Azure integration (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). The ingestion process of chunking and embedding follows guidelines for handling large documents and keeping within model token limits (Chunk documents in vector search - Azure AI Search | Microsoft Learn) (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Using hybrid retrieval with semantic reranking is backed by Azure’s research showing it improves result relevance for generative AI applications (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). Open-source comparisons are drawn from community discussions and Azure’s own guidance on vector database options (Choosing the right Azure Vector Database | by Michael John Peña | Medium), illustrating that while alternatives exist, Azure’s managed approach offers a convenient one-stop solution for enterprise needs. By adhering to these principles and configurations, the architecture achieves a balance of accuracy, clarity, and practicality in delivering LLM-generated answers grounded in domain-specific content.

Last edited Apr 04