RAG Approach to Domain-Specific Question Answering
Retrieval-Augmented Generation System Architecture with Azure AI Search
Retrieval-Augmented Generation (RAG) is an architecture that combines a Large Language Model (LLM) with an external information retrieval system to ground the model’s answers on specific data (RAG and generative AI - Azure AI Search | Microsoft Learn). In our context, the goal is to answer questions about a company’s internal policies (e.g. cloud security measures) by augmenting an LLM (such as GPT-4) with the company’s private documents (employee handbooks, policy documents, etc.). By using RAG, the LLM’s responses will be based on actual internal content, ensuring accuracy and preventing hallucinations (RAG and generative AI - Azure AI Search | Microsoft Learn).
This solution will leverage Microsoft Azure’s services for a robust, enterprise-ready implementation. We use Azure AI Search (formerly Azure Cognitive Search) as the core of our retrieval system for indexing and querying policy documents. Azure AI Search is well-suited for RAG because it supports semantic search and vector similarity search, along with Azure’s security and scale (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). In addition, Azure OpenAI Service provides the LLM (GPT-4) and embedding models needed. Azure Blob Storage will store the source documents, and Azure Functions will be used to orchestrate processes like data ingestion or to implement custom logic. The end result is an architecture where a user’s question is answered by GPT-4 using relevant excerpts from internal policies, retrieved on-the-fly from an Azure AI Search index.
System Architecture Overview
Key Components and Azure Services:
- Azure Blob Storage: Central repository for the company’s internal documents (PDF files, Word documents, etc.). All source policy documents are securely stored in blobs, which Azure AI Search or functions can ingest.
- Azure AI Search (Cognitive Search): The search service that indexes the policy documents and enables fast retrieval. Documents are indexed in chunks with both full-text indexing and vector embeddings for semantic search. Azure AI Search provides versatile query capabilities, including keyword search, vector similarity search, and hybrid combinations, along with rich filters and security controls (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium).
- Azure OpenAI Service: Provides AI models for both embeddings and answer generation. An embedding model (such as
text-embedding-ada-002
) converts text chunks into high-dimensional vectors used by Azure Search for similarity matching. The generative model (GPT-4) produces the final answer, given the user query and retrieved context. - Azure Functions: Used where custom processing or orchestration is needed. For example, an Azure Function might handle the ingestion pipeline (triggered when a new document is uploaded to Blob Storage, it will parse, chunk, embed, and index the content). Another function (or application backend) can serve as the orchestrator at query time, mediating between the user’s question, the search service, and the OpenAI model.
- Azure Cognitive Search Skillset (Optional): Azure AI Search can use built-in indexers and skillsets to automate parts of ingestion. Cognitive skills (like OCR, text extraction, text splitting, and embedding generation) may be configured so that Azure AI Search itself performs parsing, chunking, and embedding during indexing (Integrated vectorization - Azure AI Search | Microsoft Learn) (Integrated vectorization - Azure AI Search | Microsoft Learn). This reduces the need for manual coding, using Azure’s AI enrichment pipeline.
All these components work together to form the RAG system. The architecture ensures that when a user asks a question, the system retrieves the most relevant policy content and feeds it into GPT-4, which then crafts a domain-specific answer. The following sections describe this process in detail: from how documents are ingested and indexed, to how queries are processed and answered, and how Azure’s search capabilities compare to open-source alternatives.
Document Ingestion and Indexing Process
Ingestion is the process of taking raw internal documents and preparing them for efficient retrieval. This involves parsing files, splitting them into retrievable chunks, generating vector embeddings, and storing everything in the Azure AI Search index. The ingestion pipeline can be implemented using Azure AI Search indexers with a skillset or via custom code (e.g. Azure Functions). Key steps in the ingestion process include:
-
Document Acquisition: The pipeline is triggered to ingest documents, either by pulling from a data source or pushing documents into the index. In our case, the data source is Azure Blob Storage containing policy documents. We can set up an Azure AI Search indexer connected to the blob container so that it automatically finds new or updated files. Alternatively, an Azure Function can watch the blob storage for new files (using Event Grid) and initiate processing. This flexibility allows content to be loaded or refreshed at the required frequency (RAG and generative AI - Azure AI Search | Microsoft Learn). Azure Search supports both push APIs and indexers; an indexer simplifies ingestion by automatically retrieving blobs and parsing their content (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium) (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium).
-
Document Parsing: Each document is parsed to extract its text content. Azure AI Search indexers have built-in document cracking for common formats (like PDF, DOCX, HTML, etc.), so they can extract text from these files without custom code (Choosing the right Azure Vector Database | by Michael John Peña | Medium). If documents are scanned images or need complex processing, Azure Cognitive Services (like Form Recognizer) or a custom Azure Function can be used to perform OCR and text extraction. The output of this step is the raw text of each document (often with basic metadata like filename or path).
-
Text Chunking: The document text is split into smaller chunks or passages. Chunking is critical for RAG because LLMs have input length limits and because retrieving at a fine granularity improves accuracy. We aim for each chunk to be a semantically coherent unit (e.g. a paragraph, section, or answer to a single question) and to fit within embedding and prompt size limits. A common strategy is to use a fixed-size sliding window: for example, chunks of ~200-500 words with a 10-15% overlap so that context isn’t lost between chunks (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Another strategy is logical chunking by document structure – for instance, using headings or sentence boundaries to split content where appropriate (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Azure AI Search provides a Text Split skill that can split text by paragraphs, sentences, or by character length, including overlap configuration (Chunk documents in vector search - Azure AI Search | Microsoft Learn). It also offers a Document Layout skill which can chunk by semantic sections (using headings in PDFs/Word, etc.) (Chunk documents in vector search - Azure AI Search | Microsoft Learn). By chunking the content, we ensure each piece focuses on a single topic or point, which improves the relevance of search results and stays under the token limit of embedding models (Chunk documents in vector search - Azure AI Search | Microsoft Learn). (For reference, the OpenAI Ada embedding model allows up to ~8191 tokens input (Chunk documents in vector search - Azure AI Search | Microsoft Learn), but smaller chunks tend to yield better retrieval performance than one huge embedding (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub).)
-
Embedding Generation: For each text chunk, we generate a vector embedding – a numeric representation of the chunk’s semantic content. We typically use Azure OpenAI’s embedding model (such as
text-embedding-ada-002
) for this. The embedding is a 1536-dimensional vector (for Ada-002) capturing the meaning of the text. If using Azure AI Search skillsets, we can add an AzureOpenAIEmbedding skill in the pipeline, which will call the OpenAI service to embed each chunk automatically (Integrated vectorization - Azure AI Search | Microsoft Learn) (Integrated vectorization - Azure AI Search | Microsoft Learn). Alternatively, an Azure Function or external script can call the OpenAI API to get embeddings and then attach them to the chunk data. This step vectorizes the content: the text “Cloud data is encrypted at rest using AES-256” becomes a vector in high-dimensional space. Azure AI Search does not create embeddings on its own – we supply them (via skillset or API) before indexing (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The output of this step is an embedding vector for each chunk, which will be stored in the search index. -
Enrich Metadata: (Optional but recommended) We attach relevant metadata to each chunk. Metadata can include the source document ID or title, section headings, or tags like document category (e.g., “Security Policy”). We might also generate a summary of the chunk or extract keywords as metadata fields (Azure Search’s cognitive skills can do key phrase extraction, if needed). This metadata is stored alongside the chunk and can be used for filtering and for providing context in the answer. For example, a chunk may have fields:
content
(the text),content_vector
(the embedding),doc_title
(“Cloud Security Policy”),section
(“Encryption at Rest”), etc. The enrichment step ensures each chunk carries the necessary context to trace it back to source and to help rank it properly (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn). -
Indexing and Storage: Finally, the processed chunk (text + embedding + metadata) is persisted in an Azure AI Search index. Each chunk becomes a document in the search index. The index schema defines fields for the content, the vector, and metadata. For example, we define a field
content
of type Edm.String (searchable text), and a fieldcontent_vector
of type Collection(Edm.Single) with dimension 1536 (for the embedding) and an attached vector search profile (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). This vector field is marked as searchable and retrievable, enabling similarity queries. The index is configured to allow vector search oncontent_vector
(using Azure’s approximate k-NN algorithm, HNSW by default) and full-text search oncontent
(using the BM25 algorithm by default). Azure AI Search supports defining a vector index with either HNSW (for efficient ANN search) or an exhaustive KNN search for smaller datasets (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The index can also enable semantic ranking on the content (more on that shortly). If using an indexer with skillset, this indexing step happens automatically after enrichment: the indexer writes each enriched chunk to the search index (Integrated vectorization - Azure AI Search | Microsoft Learn) (Integrated vectorization - Azure AI Search | Microsoft Learn). If using a custom pipeline, the Azure Function would use the Azure Search SDK/REST API to upload documents in batches (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). Once indexing is complete, the content is ready for retrieval.
At the end of ingestion, Azure AI Search holds a secure, searchable index of the internal policy data. Each original document has been broken into many smaller indexed chunks, each with an associated embedding vector. This structure allows the system to retrieve just the relevant pieces of documents when a question is asked, rather than an entire file. The ingestion process can be scheduled or continuous. For instance, an indexer can run periodically or be triggered on new file uploads, so the index stays up-to-date with the latest policies. Azure AI Search is built for scale, capable of indexing millions of documents and handling frequent updates, which is important if the company has a large or evolving document set (RAG and generative AI - Azure AI Search | Microsoft Learn).
Semantic Search Configuration in Azure AI Search
One of the powerful features we enable in Azure AI Search is semantic search, which in Azure’s terminology refers to an advanced AI-driven ranking of search results (previously called the semantic ranker). Semantic search improves result quality by understanding the intent of the query and the meaning of the documents, rather than relying only on keyword matching. Here’s how we configure and use it:
-
Enabling Semantic Search: Azure AI Search offers semantic ranking on certain pricing tiers (it generally requires a Cognitive Search S1 or above with semantic capabilities). We ensure our search service tier supports it, and then we can turn it on in queries by specifying
queryType="semantic"
and a semantic ranking profile. This tells Azure Search to use its built-in transformer-based models to re-rank the top results. Under the hood, Azure will take the initial search results (e.g., the top 50 documents from a keyword or hybrid query) and apply a language model to score how well each document’s content semantically answers the query (Choosing Between Keyword Search, Vector Search, Keyword-Vector Hybrid Search, Semantic Search, and Keyword, Vector, Semantic Hybrid Search when doing RAG with Python SDK for Azure OpenAI - Microsoft Q&A). The output is a re-ordered list where truly relevant passages are bubbled to the top, even if they weren’t the top by raw keyword score. Semantic mode also can return a caption for each result – a sentence from the document that seems to directly address the query – which can be useful for displaying a quick answer snippet to the user. -
Semantic Search for Context Retrieval: In the RAG pipeline, our primary goal is to retrieve the most relevant chunks of text to feed to the LLM. By using semantic ranking in Azure AI Search, we improve the chances that the chunks we retrieve indeed answer the user’s question. For example, if the user asks “How do we encrypt data in the cloud?”, a pure keyword search might match chunks containing words like “encrypt” or “data” but not necessarily the full context. Semantic search will look at the meaning of the question and the meaning of the text: a chunk that explains encryption of data at rest will be ranked highly even if it uses different phrasing (like “data is secured with encryption at rest using AES-256”) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). We configure a semantic ranking profile in Azure Search that suits Q&A, typically one that weights answer-like content. This involves choosing options such as which fields to include in semantic analysis (we’d include the main content field), the maximum number of results to re-rank (e.g. top 50), and enabling captions/highlighting.
-
Hybrid Semantic Retrieval: We often use semantic ranking in combination with vector search (hybrid retrieval). Azure AI Search allows a query to use both the vector index and the keyword index together (more on hybrid in the next section), and then apply semantic ranking on the merged result set. This query type is sometimes referred to as
"vectorSemanticHybrid"
in Azure OpenAI integrations – it means do a hybrid search, then semantically rerank. By doing this, we get the broad recall of vector similarity plus the precision of semantic understanding. Microsoft’s guidance and experiments indicate that hybrid retrieval plus semantic reranking yields the best results for generative AI scenarios (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). We will adopt that approach: our Azure Search queries will use hybrid mode with semantic ranker enabled. In practice, that could be achieved by first performing a hybrid search via the Search REST API, then calling the semantic ranker, or if using the Azure OpenAI “data sources” integration, by specifying thevectorSemanticHybrid
query type. -
Tuning and Testing: We would test semantic search on sample queries to ensure it’s returning relevant snippets. Azure AI Search semantic capabilities can sometimes be tuned via the ranking profiles or by adjusting the content (for instance, adding a summary field might help). It’s also important to note that semantic search adds some latency (as it’s an extra neural processing step on top of retrieval) and may incur cognitive service costs, but for our use-case of internal Q&A, the improved relevance is usually worth it. We also ensure that any security trimming (if we had document-level security) is applied before semantic ranking, so the model only sees content the user is allowed to access.
In summary, semantic search in Azure AI Search is configured to understand the user’s query and the document meaning, giving us highly relevant passages. It works hand-in-hand with vector search in our architecture to retrieve the best possible context for the LLM.
Retrieval and Response Pipeline
Once the documents are indexed and the search service is ready, the runtime query flow (the retrieval and answer generation pipeline) works as follows. This is the core of how a user’s question gets turned into an answer with supporting data:
-
User Question Submission: A user (e.g. an employee) asks a question through a client application or chatbot interface. For instance, the user might ask: “What are our policies for encrypting customer data in the cloud?”. This question is sent to the backend of our system for processing.
-
Orchestrator Receives Query: The question is received by an orchestrator component. This could be a dedicated backend service, or an Azure Function set up for this purpose, or even logic in the client – but typically a server-side component handles it. The orchestrator’s job is to coordinate between Azure AI Search and the Azure OpenAI GPT-4 model. Using an orchestrator (which could be implemented with frameworks like Semantic Kernel, LangChain, or custom code) gives us control and ability to add business logic. In Azure’s reference architecture, the orchestrator decides which type of search to perform (keyword, vector, or hybrid) and how to construct the prompt (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn) (Design and Develop a RAG Solution - Azure Architecture Center | Microsoft Learn).
-
Formulate Search Query: The orchestrator formulates a query to Azure AI Search to retrieve relevant document chunks. Given our configuration, it will likely use a hybrid query with vector and keyword. Concretely, there are a couple of ways to do this:
- Using Azure Search API directly: The orchestrator can first generate an embedding for the user’s question (using the same embedding model as the index). For example, it calls Azure OpenAI to get a 1536-dim embedding for “What are our policies for encrypting customer data in the cloud?”. Then it sends a search request to Azure AI Search with the vector (embedding) and maybe the raw text as well. Azure Search supports queries that include a vector for similarity search on the
content_vector
field, combined with a text query on thecontent
or metadata fields. We might use Azure’s Reciprocal Rank Fusion (RRF) based hybrid mode by issuing a single request that contains both the vector and textual parts. Azure’s documentation indicates that a hybrid query will take the top results from vector search and keyword search and fuse them into one result set (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). We would includequeryType="simple"
(keyword) plus a vector clause, or use the newer combined query syntax. Additionally, we setqueryLanguage="en"
andsemanticConfiguration="<our semantic config name>"
if we want semantic ranking in the same call. The search query can also include filters if the user’s question is limited to a certain category (e.g., only search within “Security” policies if that was needed). - Using Azure OpenAI with Cognitive Search data source: Azure OpenAI Service offers an integration where you can provide Azure Cognitive Search as a data source for the model. In that setup, you send the user’s question to the Azure OpenAI ChatCompletion API with a special parameter indicating the Azure Search index details. The service itself will then perform the retrieval (based on specified parameters) and inject the results into the prompt for the model. For example, by specifying
dataSources:[{ type: AzureCognitiveSearch, parameters: { indexName: ..., etc. } }]
and aqueryType
like"vectorSemanticHybrid"
, the OpenAI service will do a vector+semantic hybrid search and supply the top results to the model automatically (Choosing Between Keyword Search, Vector Search, Keyword-Vector Hybrid Search, Semantic Search, and Keyword, Vector, Semantic Hybrid Search when doing RAG with Python SDK for Azure OpenAI - Microsoft Q&A) (Choosing Between Keyword Search, Vector Search, Keyword-Vector Hybrid Search, Semantic Search, and Keyword, Vector, Semantic Hybrid Search when doing RAG with Python SDK for Azure OpenAI - Microsoft Q&A). This is a convenient option, though it is somewhat less transparent – for clarity and control, our architecture description assumes we perform the search ourselves and then call the LLM, but it’s worth noting this managed approach exists.
- Using Azure Search API directly: The orchestrator can first generate an embedding for the user’s question (using the same embedding model as the index). For example, it calls Azure OpenAI to get a 1536-dim embedding for “What are our policies for encrypting customer data in the cloud?”. Then it sends a search request to Azure AI Search with the vector (embedding) and maybe the raw text as well. Azure Search supports queries that include a vector for similarity search on the
-
Retrieve Relevant Chunks: Azure AI Search executes the query and returns the top N results (chunks). Each result includes the chunk text (
content
) and any metadata (likedoc_title
and maybe a search score). Thanks to our indexing and search configuration:- If the query uses vector search, the results returned are those chunks with embeddings closest to the question’s embedding – i.e., semantically most similar.
- If using keyword search, results would be those with matching terms (e.g., chunks containing “encrypt” or “customer data”).
- In our hybrid approach, Azure Search will retrieve results from both and merge them. For example, it may take the 50 nearest neighbors from the vector index and 50 top BM25 text results, then apply an algorithm (RRF) to intermix them and produce a final top 50 ranking (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). This helps ensure that if an exact term match is extremely relevant it isn’t missed, while also catching semantically relevant passages that didn’t share exact wording (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub).
- Semantic reranking is applied to the merged results if configured. Azure’s semantic ranker will analyze the question and each candidate chunk to score how well the chunk answers the question, then reorder accordingly. This often brings the best answer-containing chunk to rank 1. According to Microsoft, using hybrid retrieval with semantic reranking significantly improves the quality of the top results for feeding into an LLM (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). In practice, the search response might even include a generated “caption” – a sentence from the chunk that directly addresses the query – which we could use as a quick reference. We primarily care about the chunk text itself as we will feed that to GPT-4.
We typically take the top 3-5 chunks from the search results as our context. (The number can be tuned based on observed answer quality and the length of chunks; we want enough coverage to answer the question but not so much text that we risk overloading the prompt or diluting the relevance.)
-
Compose LLM Prompt with Retrieved Context: The orchestrator now prepares a prompt for GPT-4 that includes the user’s question and the retrieved context passages. One effective prompt structure is:
- System message: A directive to the assistant, e.g. “You are an AI assistant that provides answers based on company policy documents. Answer the question using the provided information, and if the answer is not in the documents, say you don’t know.” This sets the stage and ensures the model sticks to the sources.
- User message: This contains the user’s actual question and the context. For example, it may say: “User Question: What are our policies for encrypting customer data in the cloud? Relevant Excerpts: 1. [Excerpt from Cloud Security Policy]: ‘All customer data stored in cloud systems must be encrypted at rest using at least AES-256 encryption…’ 2. [Excerpt from Data Handling Guidelines]: ‘Customer data in transit should be encrypted using TLS 1.2 or higher…’”. We list each retrieved chunk, possibly prefaced by a title or source identifier for clarity. The prompt is constructed carefully to fit within the model’s token limit (for GPT-4 8k or 32k context, depending on model version). Since our chunks were sized with token limits in mind, and we only include top few, this should be within the limit.
The model is thereby given all the information it needs: the question and the supporting content from internal docs. In some implementations, we also add a request like “Please cite the source of your answer based on the provided excerpts.” if we want the model to indicate which document the info came from. (The model can do this by referring to the excerpt numbers or metadata we provided, though formatting citations is something we might handle post-response as well.)
-
LLM (GPT-4) Generates Answer: We send the prompt to the Azure OpenAI GPT-4 model (via the Chat Completion API) and get back a response. Because the model has access to specific policy text, it will ground its answer on that. For example, GPT-4 might respond: “According to the Cloud Security Policy, all customer data in the cloud is encrypted at rest with at least AES-256. Additionally, the Data Handling Guidelines state that data in transit must use TLS 1.2 or above for encryption. This ensures that customer data is protected both when stored and when transmitted.” The answer will essentially be a summary or direct explanation of the retrieved excerpts, phrased in a helpful way for the user. The model’s role is to synthesize and present the information in the documents, using its natural language capabilities, but not to introduce new factual claims. By constraining it to the provided context, we maintain accuracy (the model is controlled by grounding data from the enterprise content (RAG and generative AI - Azure AI Search | Microsoft Learn)).
-
Response Delivery: The orchestrator (or Azure OpenAI service, if using the built-in integration) returns the answer to the user. The user sees the final answer, and potentially we can also show the sources. For example, we might display a couple of snippets or a reference like “(Source: Cloud Security Policy)” so the user trusts the answer. This is a design choice – the system can just give the answer text, or include citations for transparency. Since the question is about internal policies, often the user might want to know which document the answer came from. Our metadata (like
doc_title
or a document ID per chunk) enables us to trace back and present a citation. Many RAG implementations provide these source links to build user trust.
Throughout this pipeline, the system ensures only authorized content is retrieved (Azure Search can enforce ACLs if needed so that, e.g., HR policies aren’t shown to engineers if not permitted – though for general company policies it may be all open to employees). The combination of Azure AI Search for retrieval and GPT-4 for generation results in a powerful Q&A system: the search brings the relevant policy text (something GPT-4 would not have seen in training because it’s internal), and GPT-4 eloquently answers using that text. If a query cannot be answered from the documents (say the information isn’t in the policies), the system can detect that (e.g., if search returns no good hits or the model isn’t given any context) and respond with an appropriate message like “I’m sorry, I couldn’t find that information.”
This retrieval-response flow happens in seconds. Azure AI Search is optimized for fast queries over indexed data, and Azure OpenAI can generate answers in a few seconds for a few hundred tokens of output. The design also cleanly separates responsibilities: if tomorrow the company adds a new policy or updates an existing one, re-indexing that content will automatically make it available for the next questions, without needing to retrain any model. This is a key advantage of the RAG approach.
Search Methods in Azure AI Search: Vector, Semantic, and Hybrid
Azure AI Search provides multiple search techniques that we leverage in this RAG architecture. It’s important to understand these methods and when to use each:
-
Keyword Search (Lexical Search): This is the traditional search based on lexical matching and BM25 ranking. When Azure Search receives a query in simple mode, it will look for documents containing the query terms (or their variations/stems) and score them based on term frequency, etc. In our scenario, keyword search alone would find policy chunks that literally contain words from the question. Its strength is precision for exact matches – if the user mentions a specific term or phrase that appears in a document, keyword search will retrieve that reliably, and it can filter by fields (e.g., only look in titles or a certain category using filters). However, a pure keyword search may miss relevant information if the user’s phrasing doesn’t match the document’s wording. For example, a query for “data encryption policy” with keyword search might not return a chunk that talks about “protecting data at rest”, because the wording differs. Keyword search is fast and requires no ML model, but it lacks understanding of context or synonyms.
-
Vector Search (Semantic Vector Similarity): Vector search finds results by semantic similarity between the query and documents. It uses the embeddings we generated. When a query is converted to an embedding (either by the client or via Azure Search’s integrated vectorizer), Azure Search performs an ANN (Approximate Nearest Neighbors) search in the vector space to find the closest document vectors. This method can retrieve relevant content even if there are no literal keyword overlaps (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). In our example, a query about “encrypting customer data” will match a chunk that says “customer data must be protected with encryption”, because semantically they are similar, even if one says “protected” and the other “encrypting”. Vector search is robust to synonyms, paraphrasing, and even translations (cross-lingual embeddings would allow matching in different languages) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). The downside is that vector search might return a passage that is topically related but not actually answering the question, if the embedding space isn’t fine-grained enough. It also doesn’t inherently support boolean operators or exact phrase requirements – it’s purely based on learned semantic similarity. In Azure, vector search is enabled by having a vector field in the index and using a similarity algorithm like HNSW (which we configured). One must also ensure the query is turned into the same embedding vector format (for consistency with indexed vectors) (Integrated vectorization - Azure AI Search | Microsoft Learn). When to use: Vector search is excellent for natural language queries and when you want to capture meaning. It’s particularly useful if your users ask questions in various ways – the vector will help find the right content even if wording differs. We rely on vector search heavily in RAG because it provides the semantic grounding needed.
-
Semantic Search (Semantic Ranking): Azure’s semantic search (or semantic ranker) isn’t a separate retrieval mechanism by itself, but rather an enhancement on top of a basic retrieval. It uses large language models to interpret the query and the candidate documents, and then re-ranks or filters those candidates to improve relevance (Choosing Between Keyword Search, Vector Search, Keyword-Vector Hybrid Search, Semantic Search, and Keyword, Vector, Semantic Hybrid Search when doing RAG with Python SDK for Azure OpenAI - Microsoft Q&A). In practice, you use semantic search in combination with keyword or hybrid retrieval. Semantic ranking excels at understanding the intent behind the query. For instance, if a policy document literally contains the text “Encryption: All customer data at rest must be encrypted”, and another contains a passing mention “…to ensure compliance, encryption is used…”, a semantic ranker can determine which is a more direct answer to “How do we encrypt customer data?” (likely the first one) even if both have the keyword “encrypt”. It can also extract a relevant sentence as a answer (caption) from the document. When to use: Semantic search is ideal when your queries are in natural language and you want the search to act more like a smart Q&A, highlighting the most answer-like content. We use it in our system to improve the quality of results sent to GPT-4. Pure semantic search (without vectors) would involve Azure Search doing a keyword search then semantic rerank; this is useful if you cannot generate embeddings or have very short documents. But it might miss content if wording differs significantly. Thus, we combine it with vector search for the best of both. Semantic search adds computational overhead, so you’d enable it when you need that extra relevance boost.
-
Hybrid Search (Keyword + Vector): Hybrid search is a combination of keyword and vector searches. Azure AI Search’s hybrid retrieval will perform both approaches and merge the results, often yielding better coverage and accuracy (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). The reasoning is that keyword and vector search have complementary strengths: “Vector retrieval semantically matches queries to passages… less sensitive to synonyms or phrasing, while keyword search prioritizes specific important words that might be lost in embedding” (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). By doing both, hybrid search ensures that documents with strong keyword matches aren’t overlooked, and documents that are relevant by meaning (even without keyword overlap) are included. Azure’s implementation uses Reciprocal Rank Fusion (RRF) to merge, which means it takes into account the ranks from each method to produce a unified ranking (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). Our architecture uses hybrid search heavily. When to use: In most RAG use-cases (especially with diverse query styles), hybrid tends to outperform either method alone. Microsoft’s experiments showed that hybrid retrieval improves metrics like recall and NDCG for a variety of query types (conceptual queries, factual queries, etc.), and when you add semantic ranking on top, it achieves the best grounding results for generative AI (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). We would use pure vector search only if we are confident that lexical matches aren’t needed at all (for example, if the data is such that only semantic similarity matters). Conversely, we might use pure keyword search in a scenario where user queries are expected to contain exact terms (like searching a code or an ID). But for open-ended questions in natural language, hybrid + semantic is recommended (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). In Azure AI Search, you might explicitly specify
vector
andsearch
parameters together, or use the predefined query types (vectorSimpleHybrid
for vector+BM25, andvectorSemanticHybrid
for vector+BM25+semantic). In our implementation, we go with vectorSemanticHybrid to maximize relevant context retrieval.
To summarize the trade-offs: Vector search gives broad semantic recall, Keyword search gives precise lexical matching, Semantic search gives intelligent relevance ranking, and Hybrid search gives a balance of recall and precision by leveraging both vector and keyword (and can be further improved with semantic rerank). By using hybrid retrieval with semantic ranking in Azure AI Search, our system captures the strengths of each method – ensuring the LLM gets the most relevant and comprehensive context to answer user questions accurately. This approach is in line with current best practices for RAG systems on Azure: experiments confirm that chunked content + hybrid retrieval + semantic reranking yields significantly higher quality in grounded answers (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub).
Azure Solution vs. Open-Source Alternatives
It’s valuable to compare this Azure-based RAG architecture with alternative solutions using open-source or other technologies for document indexing, vector storage, and semantic retrieval. Here we highlight how Azure’s managed services stack up against some popular options:
-
Azure AI Search vs. Elasticsearch/OpenSearch: Elasticsearch is a well-known open-source search engine that can be configured to handle vector search in addition to text search (Open Distro and OpenSearch, or Elastic 8.x, have k-NN vector capabilities). In an open-source RAG implementation, one might use Elasticsearch to index documents and store embeddings (as dense vector fields). Elasticsearch allows combining vector similarity scores with BM25 scores in a hybrid manner (for example, using script scoring or a feature fusion) and can integrate with external ML models for re-ranking. Elastic has even introduced their own learned sparse encoder model to enhance semantic search on text (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The key differences: Azure AI Search is a fully managed service – you don’t have to manage server instances or scaling, and it’s integrated with Azure security (AAD, Private Link, etc.). It also has built-in skills for OCR, text extraction, and an easy way to call OpenAI for embeddings in the indexing pipeline. With Elasticsearch, you have more manual work: you’d need to set up an ETL pipeline (or use tools like Haystack or custom code) to parse documents, chunk them, generate embeddings (perhaps using Python libraries or an API), index into ES, and then handle queries by manually combining results. Elasticsearch might give more flexibility in fine-tuning scoring algorithms or deploying on-premises. However, Azure Search’s one-stop capabilities (text + vector + semantic in one service) simplify the architecture a lot (Choosing the right Azure Vector Database | by Michael John Peña | Medium). If an organization is already invested in Elastic stack, they could implement a similar RAG solution but would need to maintain that infrastructure. Azure’s solution is likely faster to implement and less maintenance, at the cost of being tied to Azure’s platform and pricing.
-
Vector Databases (Qdrant, Weaviate, Milvus, Vespa, etc.): In recent years, dedicated vector databases have emerged. These include Qdrant, Weaviate, Milvus (all open-source), and others like Pinecone (hosted). They specialize in storing embeddings and performing similarity search very efficiently. For example, Qdrant provides a simple API to upsert vectors with payloads (metadata) and query by vector similarity (Choosing the right Azure Vector Database | by Michael John Peña | Medium). Weaviate offers vector search with a GraphQL interface and even has modules for text classification or using Transformers for on-the-fly embedding. Vespa (by Yahoo/Oath) is another engine that can do combined vector and keyword search with advanced ranking features, comparable to Azure Search in capability. Using a standalone vector DB in a RAG solution means you often pair it with a separate text search engine or use it primarily for semantic search and rely on metadata filters for any keyword constraints. For instance, you might store all policy chunks in Qdrant (each with an embedding and some text metadata) and query it by vector; if needed, you could also store the text in the same DB and do a brute-force text filter, but that’s not as powerful as a real search engine’s text query. Some vector DBs (like Weaviate) do offer basic BM25 text search as well, but their core strength is vectors. Comparison: Azure AI Search combines both vector and text search, so it eliminates the need to maintain two systems. On the other hand, if your use-case required extremely custom vector operations or if you wanted to experiment with different ANN algorithms beyond HNSW, a specialized vector DB might offer that flexibility. Open-source vector DBs can be deployed on your own infrastructure or even on Azure VMs/Kubernetes (Azure has samples for deploying Qdrant, Milvus, etc., in its marketplace (Choosing the right Azure Vector Database | by Michael John Peña | Medium) (Choosing the right Azure Vector Database | by Michael John Peña | Medium)). Those might be cost-effective at very large scale, since you could avoid certain per-document costs of Azure Search by managing infrastructure yourself. However, you lose the tight integration with Azure OpenAI and the ease of the skillset pipeline. A hybrid approach some teams take is using a vector DB for retrieval and then using an LLM to rerank or directly answer, but replicating Azure’s semantic ranker would require your own model (like using a MiniLM or BERT cross-encoder to score relevance).
-
Azure’s Integrated Approach: Azure’s RAG solution (Search + OpenAI) is highly integrated. For example, Microsoft provides sample architectures and even end-to-end solutions (in Python, .NET, etc.) to set this up (RAG and generative AI - Azure AI Search | Microsoft Learn). The integration can go as far as having Azure OpenAI handle the retrieval step internally when you provide a data source. Open-source alternatives require stitching components together: you’d use something like LangChain or LlamaIndex (GPT Index) as the orchestration layer to connect a vector store, a text store, and an LLM. Those libraries do make it easier (for instance, LlamaIndex can treat an Elastic index or a Weaviate instance as a knowledge source), but it’s code you maintain. Azure’s platform gives you a more managed experience, at the expense of being less customizable under the hood. That said, Azure Cognitive Search is quite tunable (scoring profiles, analyzers for text, etc.) and its new capabilities (like built-in chunking/embedding) reduce the amount of glue code needed.
-
Semantic Retrieval and Ranking: When it comes to semantic ranking (the second stage re-rank), Azure’s semantic search uses a proprietary model (likely a version of MS MARCO MEB or Turing NLR) that’s been optimized for search relevance. Open-source equivalent would be using a model like
ms-marco-MiniLM
orcross-encoder/ms-marco-electra
from HuggingFace to rerank top results. You’d have to run that model inference yourself on the top results from your search engine. Projects like Haystack provide components to do this. This adds complexity and compute cost (you’d need a GPU server or a fast inference endpoint). Azure does this behind the scenes for you with semantic search. On the flip side, doing it yourself means you could choose which model to use or even fine-tune it on your data. Azure’s model is general-purpose and not tunable by the user (aside from slight config). If extremely high recall or custom ranking is needed, an open approach might be better. But for most cases, Azure’s semantic ranker is a strong out-of-the-box solution. -
Scalability and Performance: Both Azure Search and engines like Elastic or Milvus are designed to scale to large datasets, but their scaling models differ. Azure Search lets you scale by partitions/replicas (with limits per service tier) (Elasticsearch VS Azure Search: Overall Comparison and Performance Study | by Webuters Technologies | Medium) (Elasticsearch VS Azure Search: Overall Comparison and Performance Study | by Webuters Technologies | Medium) – you pay for the units and Microsoft manages distribution. Elastic/Weaviate you’d horizontally scale your cluster nodes. In terms of performance tuning, open-source might allow more low-level tweaks (shard counts, ANN index parameters). Azure Search does allow some control (you can define your HNSW
vectorSearchProfile
with parameters, choose vector distance metric, etc.). For most users, Azure’s defaults will be fine. If an organization already has a big data infrastructure, they might integrate RAG with that (for example, vector search using OpenSearch and storing indices on-prem). Azure’s solution particularly shines when you are already in the Azure ecosystem and want a quick, reliable implementation. -
Cost Considerations: Azure Cognitive Search is a paid service (with cost based on number of search units, which scale with data size and query load). Azure OpenAI has its own cost per 1K tokens for embeddings and Chat completion. In contrast, using open-source on your own servers shifts the cost to infrastructure (VMs, maintenance, possibly cheaper if you have spare capacity or cheaper cloud options). If the volume of data or queries is very high, teams might evaluate cost trade-offs. Sometimes a hybrid approach is used: e.g., keep an offline vector store for huge data and use Azure Search for a smaller curated index. But given our scenario (internal documents, likely a manageable size), the convenience of Azure likely outweighs cost differences. And Azure’s free skillsets for OCR, etc., also save implementation effort.
In conclusion, Azure’s RAG architecture provides an end-to-end, integrated solution: Azure AI Search acts as both a vector store and search index, with built-in AI enrichment, and Azure OpenAI provides world-class language models. This tightly integrated approach is especially practical for enterprise scenarios where security, reliability, and support are important. Open-source alternatives like Elasticsearch, Qdrant, Weaviate, or Vespa can absolutely be used to build a similar system, and in some cases might offer more flexibility or avoid vendor lock-in. However, they would require more engineering work to achieve the same level of functionality. As one analysis put it: “Azure Cognitive Search is the one-stop shop if you want to go all-in on vectors on Azure… Elastic is a like-to-like contender if you manage it yourself, and for specialized vector operations you could host Pinecone, Milvus, Qdrant, etc.” (Choosing the right Azure Vector Database | by Michael John Peña | Medium). The choice often comes down to where your data and apps already reside, and how much you value a managed service. For a company already using Azure (and given the need to keep internal data secure within a trusted cloud), leveraging Azure AI Search with Azure OpenAI is an excellent fit.
Conclusion
The described Retrieval-Augmented Generation system architecture provides a powerful way to harness internal company knowledge with the capabilities of generative AI. By using Azure AI Search to index and retrieve document chunks, and Azure OpenAI’s GPT-4 to generate answers, the system can deliver accurate, context-aware responses to users’ questions about internal policies. We covered how documents flow through an ingestion pipeline (from Blob storage, through parsing, chunking, and embedding, into the search index) and how at query time the relevant pieces of information are fetched (using vector, semantic, and hybrid search techniques) and fed into the LLM to produce a grounded answer. Key Azure services – Search, OpenAI, Blob, Functions – work in concert to achieve this, each handling what it’s best at (search/indexing, language understanding, storage, and glue logic respectively).
This RAG approach ensures answers are not just fluent, but also correct and specific to the company’s data. Compared to a standalone LLM, it dramatically reduces hallucinations and allows the system to answer questions that the base model wouldn’t know (since it injects proprietary knowledge). We also discussed how Azure’s solution compares to open-source solutions, noting that while one could assemble the pieces oneself, Azure provides an integrated and enterprise-ready path. In practice, the architecture can be expanded or adapted – for example, adding Azure AD authentication so that the search results respect user permissions, or using Azure Monitor to log and analyze which questions are asked and which documents are used (helpful for improving the system). But at its core, the architecture remains: Index your knowledge, retrieve relevant context, and let the LLM answer using that context.
By following this architecture, the company can deploy a virtual assistant that expertly answers cloud security policy questions (or any internal policy queries), with the confidence that the answers are grounded in the latest official documentation. This not only saves time for employees seeking information but also ensures consistent and accurate messaging aligned with the company’s policies. It’s a prime example of how AI can be safely and effectively leveraged in a domain-specific way by combining strengths of information retrieval and generation.
Sources: The design and recommendations above are based on Microsoft’s documentation and best practices for Azure Cognitive Search and RAG, as well as industry comparisons. Azure AI Search is recommended for RAG due to its integrated vector and hybrid search capabilities and seamless Azure integration (Azure AI search for RAG use cases | by Anurag Chatterjee | Medium). The ingestion process of chunking and embedding follows guidelines for handling large documents and keeping within model token limits (Chunk documents in vector search - Azure AI Search | Microsoft Learn) (Chunk documents in vector search - Azure AI Search | Microsoft Learn). Using hybrid retrieval with semantic reranking is backed by Azure’s research showing it improves result relevance for generative AI applications (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub) (Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub). Open-source comparisons are drawn from community discussions and Azure’s own guidance on vector database options (Choosing the right Azure Vector Database | by Michael John Peña | Medium), illustrating that while alternatives exist, Azure’s managed approach offers a convenient one-stop solution for enterprise needs. By adhering to these principles and configurations, the architecture achieves a balance of accuracy, clarity, and practicality in delivering LLM-generated answers grounded in domain-specific content.