What is GraphRAG and when should I use it?

GraphRAG extracts entities and relationships from documents into a knowledge graph, which is queried alongside (or instead of) vector search. Use it for questions that depend on relationships, hierarchies, or aggregations across documents — areas where pure vector search struggles.

Why is hybrid search better than vector search alone?

Vector search excels at semantic similarity but misses exact terms (model numbers, error codes, proper nouns). Keyword search (BM25) catches these. Hybrid search combines both with reciprocal rank fusion, typically improving recall by 20–30% on technical or domain-specific content.

Do I need a re-ranker on top of my RAG retrieval?

For production-quality results, yes. A cross-encoder re-ranker (e.g., Cohere Rerank, BGE Reranker) reorders the top 20–50 retrieved chunks by precise relevance to the query. This often boosts answer quality more than any other single change.

What is agentic retrieval?

Agentic retrieval uses an LLM to decompose complex queries into sub-questions, retrieve for each, then synthesize. For multi-hop or comparison questions, this beats single-shot retrieval. Tradeoff: higher latency and cost.

RAG Beyond the Basics: GraphRAG, Hybrid Search, and What Actually Works

Why Basic RAG Hits the Ceiling

Retrieval-Augmented Generation (RAG) has become the standard pattern for grounding large language models in enterprise knowledge. The basic pattern—chunk documents, embed them, retrieve top-k similar chunks, pass to LLM—works remarkably well for simple question-answering over documents. But as organizations scale RAG systems to production, they encounter fundamental limitations that basic vector similarity search cannot overcome.

The first limitation is semantic drift. Vector embeddings capture local semantic meaning, but they struggle with queries that require understanding relationships across multiple documents, temporal sequences, or hierarchical structures. A question like "How has our pricing strategy evolved over the past three years and how does it correlate with customer churn?" requires connecting information across quarterly reports, pricing announcements, and customer data—something pure vector similarity handles poorly.

The second limitation is keyword blindness. Semantic embeddings excel at conceptual similarity but fail on exact matches. If a user searches for "ISO 27001 certification" or a specific product SKU like "SKU-2847-B", semantic search may return conceptually related but factually incorrect results. In regulated industries—healthcare, finance, legal—this is unacceptable.

The third limitation is context collapse. When you chunk a 100-page technical specification into 500-token segments, you lose document structure, cross-references, and the relationships between sections. The retrieved chunks lack the surrounding context that humans use to interpret technical documentation.

Advanced RAG patterns solve these problems by moving beyond naive vector similarity to incorporate graph structures, hybrid retrieval, intelligent re-ranking, and context-aware chunking strategies. Organizations implementing these patterns see 40-60% improvements in retrieval accuracy and dramatic reductions in hallucination rates.

GraphRAG: Adding Structure to Semantic Search

GraphRAG, pioneered by Microsoft Research, addresses the relationship problem by extracting an explicit knowledge graph from your document corpus. Instead of treating documents as isolated chunks, GraphRAG identifies entities (people, products, concepts), relationships (reports to, depends on, contradicts), and creates a queryable graph structure alongside your vector embeddings.

The architecture has three components. First, an entity extraction pipeline uses NER (Named Entity Recognition) and relation extraction models to identify structured information in your documents. For enterprise content, you often need custom entity types—not just person/organization/location, but product names, regulatory codes, internal project names, and domain-specific concepts.

Second, a graph database (Neo4j, Amazon Neptune, Azure Cosmos DB with Gremlin API) stores entities and relationships. Each node contains not just structured attributes but also vector embeddings of the entity description, enabling hybrid graph-vector traversal.

Third, a query planning layer translates natural language questions into graph traversal + vector search operations. A question like "Which products are affected by the Q3 supply chain issues?" becomes: (1) semantic search for "supply chain issues Q3", (2) extract mentioned entities, (3) graph traversal to find connected product nodes, (4) retrieve detailed chunks about those specific products.

GraphRAG excels for multi-hop reasoning. Microsoft's implementation on a 1M-document corporate knowledge base showed that questions requiring 2+ reasoning steps had 73% accuracy with GraphRAG versus 41% with standard RAG. The performance gap widens as question complexity increases.

Implementation Considerations

Building GraphRAG requires upfront investment. Entity extraction quality determines graph quality—garbage in, garbage out. For specialized domains, you need to fine-tune extraction models or use LLM-based extraction with careful prompt engineering. Expect 2-4 weeks of iteration to get entity schemas and extraction pipelines production-ready.

Graph maintenance is non-trivial. As documents update, you must propagate changes through the graph, merging entities and updating relationships. Implementing proper graph versioning and change tracking is essential for production systems. Budget for ongoing curation—especially in the first 3-6 months as you refine entity types and relationship definitions.

At Gennoor Tech, we help organizations implement GraphRAG through our Azure AI training programs, covering entity extraction, graph design, and production deployment patterns for enterprise knowledge systems.

Advanced RAG pipeline: multi-source retrieval with intelligent fusion

Basic RAG

Vector similarity only. 41% accuracy on multi-hop questions. Struggles with exact matches and relationships.

GraphRAG

Entity-aware retrieval. 73% accuracy on multi-hop questions. Excels at relationship-heavy knowledge.

Hybrid + Re-Rank

Combined keyword and semantic search with cross-encoder re-ranking. 15-30% precision improvement.

Hybrid Search: Combining Keyword and Semantic Retrieval

Hybrid search solves the keyword blindness problem by combining traditional keyword search (BM25, Elasticsearch) with semantic vector search, then intelligently fusing the results. The pattern is simple but powerful: run both searches in parallel, normalize scores, and merge using a weighted combination or learned ranking model.

Azure AI Search (formerly Cognitive Search) provides built-in hybrid search with configurable semantic-keyword weighting. A typical configuration might be 60% semantic, 40% keyword for general knowledge retrieval, but 80% keyword, 20% semantic for compliance documents where exact term matches are critical.

The reciprocal rank fusion (RRF) algorithm provides a simple, effective way to merge result sets without requiring score calibration. RRF assigns each result a score based on its rank position (1/rank) across all retrievers, then sums scores to produce final rankings. This avoids the problem of comparing raw scores from different retrieval systems with incompatible score distributions.

For production systems, implement query classification to dynamically adjust hybrid weights. Use a small classification model (fine-tuned BERT, few-shot GPT-4o-mini) to categorize incoming queries as factual/conceptual/navigational, then apply appropriate keyword-semantic weights. Factual queries ("What is our GDPR data retention policy?") lean keyword-heavy; conceptual queries ("How do we approach customer retention?") lean semantic-heavy.

Metadata Filtering and Faceted Search

Hybrid search becomes dramatically more effective when combined with metadata filtering. Instead of searching across your entire corpus, pre-filter by document type, date range, department, security classification, or custom taxonomies before executing semantic/keyword retrieval.

Structure your index with rich metadata: doc_type, created_date, last_modified, department, security_level, document_stage (draft/approved/archived), tags, author, and domain-specific fields. Then expose filters through your query interface: "Search technical documentation from 2024 and 2025 where security_level is internal or public."

Metadata filtering typically improves retrieval precision by 25-40% while reducing irrelevant results and speeding up queries by limiting the search space. For large corpora (1M+ documents), proper filtering is the difference between acceptable and unusable performance.

Re-Ranking: Two-Stage Retrieval for Quality

Re-ranking implements a two-stage retrieval pattern: fast first-stage retrieval returns 50-100 candidates, then a slower, more accurate second-stage model re-ranks the top candidates. This architecture balances speed and quality—fast retrieval handles the bulk of filtering, expensive re-ranking focuses compute on the most promising results.

First-stage retrieval uses efficient vector search (FAISS, Azure AI Search) or hybrid search to quickly narrow from millions of chunks to 50-100 candidates. This stage prioritizes recall—cast a wide net to ensure relevant documents are in the candidate set.

Second-stage re-ranking uses cross-encoder models (like ms-marco-MiniLM or bge-reranker-large) that process query-document pairs jointly, capturing fine-grained relevance signals that bi-encoder embeddings miss. Cross-encoders are 10-50x slower than vector similarity but 15-30% more accurate, making them perfect for re-ranking small candidate sets.

Cohere offers a re-ranking API that's production-ready out of the box. Azure AI Search includes semantic re-ranking using Microsoft's proprietary models. For full control, deploy open-source re-rankers (BGE, ColBERT) on Azure Container Apps or AKS.

Advanced Re-Ranking Strategies

Beyond simple cross-encoder re-ranking, consider LLM-as-a-judge re-ranking for complex domains. Use GPT-4o or Claude to evaluate whether each retrieved chunk actually answers the user's question, providing a relevance score and explanation. This is expensive (5-10 cents per query for 20 candidates) but achieves state-of-the-art quality for high-value use cases—regulatory compliance, medical question-answering, legal research.

Implement diversity re-ranking to avoid redundant results. Standard retrieval often returns multiple chunks from the same document or near-duplicate content. Diversity algorithms (MMR, maximal marginal relevance) penalize similarity between retrieved chunks, ensuring you return distinct information. This is especially important for summarization tasks where you want comprehensive coverage.

Multi-Index Architectures and Specialized Retrievers

Enterprise knowledge exists in multiple formats—documents, structured databases, code repositories, wikis, tickets, emails. A single-index RAG system treats everything as unstructured text, losing the specialized retrieval patterns each format requires.

Multi-index architectures maintain separate indexes for different content types, each optimized for its format. A typical enterprise setup includes: (1) document index with hybrid search, (2) structured data index (SQL, knowledge graph), (3) code index with syntax-aware search, (4) conversation index for support tickets and emails with thread-aware retrieval.

The query routing layer classifies incoming questions and dispatches to appropriate indexes. "What is our refund policy?" → document index. "Show me all customers who churned after price increase" → structured data index with SQL generation. "How does the authentication module work?" → code index with semantic code search.

Implement specialized retrievers for each content type. For code, use tree-sitter parsing to index at function/class granularity with symbol-aware embeddings. For structured data, use text-to-SQL or text-to-Cypher models to generate precise database queries instead of keyword search. For emails/tickets, maintain conversation threads and retrieve entire threads instead of isolated messages.

Chunk Optimization and Parent-Child Retrieval

Chunking strategy has enormous impact on RAG quality. The default approach—split documents into 500-token overlapping chunks—is simple but suboptimal. It fragments logical sections, splits tables and code blocks, and loses document structure.

Semantic chunking uses NLP to identify natural document boundaries—section headings, topic shifts, paragraph breaks—rather than fixed token counts. Libraries like LangChain and LlamaIndex provide semantic splitters that preserve logical coherence. Expect 20-30% better retrieval quality compared to naive fixed-size chunking.

Parent-child retrieval indexes small chunks for retrieval precision but returns larger parent contexts for generation. Store 200-token child chunks with embeddings, but maintain links to 1,000-token parent sections. When a child chunk matches, retrieve its parent for LLM context. This gives you narrow retrieval (matching specific facts) with broad context (surrounding explanation).

For technical documentation, implement structured chunking that respects document hierarchy. Each chunk includes metadata about its position: doc_title, section, subsection, chunk_index. When generating responses, the LLM sees not just chunk content but structural context: "From section 3.2 'Security Configuration' of the Deployment Guide..."

HyDE: Hypothetical Document Embeddings

HyDE (Hypothetical Document Embeddings) inverts the retrieval problem. Instead of embedding the user's question and searching for similar documents, HyDE asks the LLM to generate a hypothetical answer to the question, embeds that answer, and searches for documents similar to the hypothetical answer.

Why does this work? User questions are often short, vague, and use different vocabulary than documentation. Hypothetical answers are longer, detailed, and written in documentation style—making them more similar to actual documents. A question like "How do I configure SSO?" generates a hypothetical answer: "To configure single sign-on, navigate to Settings → Authentication → SSO Configuration. Enter your identity provider metadata URL and configure SAML assertions..." This hypothetical answer embeds much closer to the actual documentation than the original question.

HyDE adds latency (one extra LLM call) and cost but typically improves retrieval accuracy by 15-25%, especially for vague or poorly-worded questions. Implement HyDE selectively—use query classification to identify ambiguous questions that benefit from hypothetical answer generation.

Query Transformation and Decomposition

Complex questions often need transformation before retrieval. A question like "Compare our Q4 performance to Q3 and explain the causes of any revenue decline" needs decomposition into sub-queries: (1) retrieve Q4 performance data, (2) retrieve Q3 performance data, (3) compare metrics, (4) retrieve context about business changes between Q3 and Q4.

Query decomposition uses an LLM to break complex questions into simpler sub-queries, execute retrievals in parallel or sequence, then synthesize results. This pattern handles multi-hop reasoning, comparisons, and analytical questions that single-shot retrieval cannot address.

Query expansion generates alternative phrasings or related terms. "machine learning deployment" expands to include "ML model serving", "inference infrastructure", "production ML systems". Execute retrievals for all variations and merge results. This compensates for vocabulary mismatch between user queries and document language.

Step-back prompting generates higher-level questions before retrieval. For "How do I debug OAuth token expiration in production?", first retrieve on the step-back question "How does OAuth token lifecycle work?" to get conceptual grounding, then retrieve on the specific debugging question. This two-phase retrieval provides both conceptual foundation and specific answers.

Evaluating Advanced RAG Systems

Advanced RAG requires rigorous evaluation. Start with a golden dataset of question-answer pairs curated by domain experts. For each question, annotate which documents/chunks should be retrieved (retrieval ground truth) and the correct answer (generation ground truth).

Measure retrieval metrics: precision@k (what fraction of top-k results are relevant), recall@k (what fraction of relevant documents appear in top-k), MRR (mean reciprocal rank—position of first relevant result), and NDCG (normalized discounted cumulative gain—rewards relevant results at top ranks).

Measure generation metrics: factual accuracy (does the answer contain correct information from retrieved context), groundedness (is the answer supported by retrieved documents or hallucinated), completeness (does it address all parts of the question), and conciseness (no unnecessary information).

Use LLM-as-a-judge evaluation for subjective quality, anchored on RAGAS faithfulness and similar groundedness metrics. GPT-4o evaluates whether generated answers are accurate, relevant, and well-grounded by comparing them to retrieved context and reference answers. This correlates well with human judgment (0.85+ agreement) and scales to thousands of evaluations.

Azure AI Studio provides built-in RAG evaluation with metrics for groundedness, relevance, and coherence. LangSmith offers evaluation datasets, metrics tracking, and A/B testing for comparing RAG configurations. Implement continuous evaluation in your CI/CD pipeline—run your golden dataset on every RAG system change.

Performance Comparisons and ROI

Organizations implementing advanced RAG patterns report significant improvements over basic vector search. Microsoft's internal study showed GraphRAG achieving 73% accuracy on multi-hop questions versus 41% for standard RAG—a 78% relative improvement. Hybrid search typically adds 15-25% precision improvement over pure semantic search, with larger gains in specialized domains.

Re-ranking improves top-3 relevance by 20-30% while adding 100-300ms latency per query. For most enterprise use cases, this trade-off is favorable—users strongly prefer slower, accurate results over fast, irrelevant results. Implement caching and async processing to mitigate latency impact.

The business impact is substantial. Customer support organizations report 40-50% reduction in escalations when agents use advanced RAG systems versus basic keyword search. Legal and compliance teams reduce research time by 60-70% with GraphRAG-enhanced case law and regulation retrieval. Engineering teams resolve bugs 30% faster with hybrid code search combining semantic and syntax-aware retrieval.

Cost increases are manageable. GraphRAG adds upfront entity extraction cost (approximately $50-200 per 1M tokens) and graph database hosting ($100-500/month for mid-size deployments). Hybrid search requires maintaining both vector and keyword indexes (approximately 2x storage). Re-ranking adds 2-5 cents per query for Cohere API or hosting costs for self-deployed models. For high-value use cases, ROI is overwhelmingly positive.

Implementation Roadmap for Production RAG

Start with baseline RAG using Azure AI Search hybrid search with semantic ranking enabled. Implement proper chunking (semantic boundaries, not fixed tokens), rich metadata filtering, and basic evaluation. Get this into production quickly—basic RAG solves 70% of use cases and provides infrastructure for advanced patterns.

Add re-ranking next. Implement two-stage retrieval with cross-encoder re-ranking (Cohere API or self-hosted BGE). Measure precision improvements on your golden dataset. This is high-impact and relatively low-complexity—typically 2-3 weeks to production.

Introduce query transformation for complex questions. Implement decomposition for multi-part questions and HyDE for ambiguous queries. Use query classification to apply transformations selectively. This addresses the long tail of complex questions that basic retrieval handles poorly.

For organizations with relationship-heavy knowledge (legal precedent, technical dependencies, organizational structures), implement GraphRAG. This is a 2-3 month effort requiring entity schema design, extraction pipeline development, graph database deployment, and query planning logic. Prioritize use cases where multi-hop reasoning delivers clear business value.

Implement continuous evaluation and monitoring throughout. Track retrieval quality metrics, user satisfaction signals (thumbs up/down, query refinement patterns), and business outcomes (resolution time, escalation rates). Use this data to prioritize improvements and validate that advanced patterns deliver ROI.

At Gennoor Tech, our enterprise AI training programs guide organizations through this roadmap, from initial RAG deployment to advanced patterns like GraphRAG and multi-agent orchestration. We provide hands-on implementation support, architecture reviews, and team upskilling to ensure your RAG systems deliver production-grade quality and measurable business impact. Explore more advanced AI patterns on our blog, where we cover the latest in enterprise AI implementation.

RAG Beyond the Basics: GraphRAG, Hybrid Search, and What Actually Works

Why Basic RAG Hits the Ceiling

GraphRAG: Adding Structure to Semantic Search

Implementation Considerations

Hybrid Search: Combining Keyword and Semantic Retrieval

Metadata Filtering and Faceted Search

Re-Ranking: Two-Stage Retrieval for Quality

Advanced Re-Ranking Strategies

Multi-Index Architectures and Specialized Retrievers

Chunk Optimization and Parent-Child Retrieval

HyDE: Hypothetical Document Embeddings

Query Transformation and Decomposition

Evaluating Advanced RAG Systems

Performance Comparisons and ROI

Implementation Roadmap for Production RAG

Frequently asked questions

References & further reading

Jalal Ahmed Khan

Stay ahead of the curve

Continue reading

Incognito for AI: Meta Launches a Truly Private Way to Chat With AI on WhatsApp — Built on Muse Spark and Private Processing

The Defender's Daybreak: OpenAI Launches an AI Cybersecurity Stack — Days After Google Detects the First AI-Built Zero-Day

Only 3 Jobs Will Survive AI? What Bill Gates, Suleyman, and Other Leaders Are Really Saying

Gennoor Tech