Engineering

Building a 6x faster GraphRAG: typed graphs, FAISS, and two-phase retrieval

Deval Parikh

|March 2025|15 min read

I wanted a system that could intelligently find the perfect group of people for a given project, pulling from GitHub activity, Slack conversations, and other signals to assemble a team that actually fits. Microsoft’s GraphRAG was the obvious place to start, but once I tried indexing real repos, the problems became clear: indexing just a day’s worth of PRs from a single repo took minutes, the pipeline assumed unstructured text, and nearest-neighbor search alone couldn’t follow the relationships between PRs, users, and issues that actually determine who’s the right fit.

So I built a custom GraphRAG that indexes the same data about 6x faster, uses a typed graph model with pluggable ingestors and FAISS, and does two-phase retrieval (semantic search first, then graph traversal along typed edges) to match people to projects based on real contribution history.

The problem with MS GraphRAG for structured data

MS GraphRAG does a lot of things well. It builds community-based summaries from documents and supports global, local, and drift search modes. But it’s designed for unstructured text: research papers, articles, documents. In practice, a lot of the data teams actually work with is already structured. GitHub PRs have authors, issues have assignees, Jira tickets have reporters, Slack threads have participants. APIs return these relationships explicitly, so there’s no need to spend LLM calls extracting what’s already there.

Three things pushed me to start over:

Indexing was slow.

MS GraphRAG chunks documents, extracts entities with an LLM, runs community detection, then summarizes communities with another LLM pass. On a single repo with ~1 day of activity, this took ~180 seconds. Most of that time was spent on LLM calls that I didn’t need — when the data source already tells you that user:alice opened pull_request:#15, there’s nothing for an LLM to extract.

Adding data sources was painful.

Everything has to be flattened into text files in an input directory. But structured APIs already return typed objects — users, PRs, channels — and destroying that structure just to reconstruct it with an LLM is wasteful. I wanted to plug in GitHub, Slack, or any future source through a shared typed interface, not reformat everything into documents.

KNN can’t answer relationship questions.

“Find engineers with overlapping contributions to both the auth and payments modules” requires traversing from topics to PRs to authors across two different subgraphs and intersecting the results. KNN can find documents about auth or payments, but it can’t tell you which people bridge both — that answer lives in the graph structure, not in any single document’s embedding.

Architecture

The system has two paths (query and ingest) that converge on separate index stores. Both mechanisms coexist behind a single API, and a Mechanism enum controls routing:

Query path

Ingest path

Both mechanisms share the same pluggable ingestors and DataStore. The difference is what happens after. MS GraphRAG converts to text and runs multiple LLM passes, while Custom GraphRAG builds a typed graph directly with no LLM extraction.

python

class Mechanism(str, Enum):
    MS_GRAPHRAG = "ms_graphrag"
    CUSTOM_GRAPHRAG = "custom_graphrag"

This makes it easy to run both on the same data and compare results, or switch over once you’re comfortable with the custom pipeline.

Typed graph data model

The main thing that makes the custom pipeline faster: I don’t use an LLM to extract entities. A GitHub PR is already an entity. A user is already an entity. The “opened_pr” relationship between them is right there in the API response. Typed ingestion preserves this structure end-to-end — from API response to graph node — so the three most expensive stages of MS GraphRAG (LLM extraction, community detection, community summarization) simply disappear.

The type system is a pair of enums:

python

class EntityType(Enum):
    USER = "user"
    DOCUMENT = "document"
    SLACK_CHANNEL = "slack_channel"
    GITHUB_REPO = "github_repo"
    PULL_REQUEST = "pull_request"
    ISSUE = "issue"

class RelationshipType(Enum):
    CONTRIBUTES_TO = "contributes_to"
    MESSAGES_IN = "messages_in"
    OPENED_PR = "opened_pr"
    MERGED_PR = "merged_pr"
    OPENED_ISSUE = "opened_issue"
    BELONGS_TO = "belongs_to"

Entities and relationships are dataclasses with typed fields and a property bag for anything source-specific:

python

@dataclass
class Entity:
    type: EntityType
    name: str
    properties: Dict[str, Any]

@dataclass
class Relationship:
    source_type: EntityType
    source_name: str
    target_type: EntityType
    target_name: str
    relationship_type: RelationshipType
    properties: Dict[str, Any]

Want to add Jira tickets? Add a JIRA_TICKET enum variant and write an ingestor. The graph, FAISS index, and query layer don’t change.

Pluggable ingestors

Every data source implements one method:

python

class BaseIngestor(ABC):
    @abstractmethod
    def ingest_impl(self) -> Dict[str, List]:
        """Return {'entities': List[Entity], 'relationships': List[Relationship]}."""
        ...

The GithubRepoIngestor pulls PRs, issues, and commits from the GitHub API and returns typed entities with full metadata (PR diffs, issue bodies, commit messages), all of which go into the node’s embedding later:

python

class GithubRepoIngestor(BaseIngestor):
    def __init__(self, repo_url: str, github_token: str = None, days_back: int = 1):
        self.repo_url = repo_url
        self.github_api = GitHubAPI(token=github_token)
        self.days_back = days_back

    def ingest_impl(self) -> Dict[str, List]:
        entities_dict: Dict[str, Entity] = {}
        relationships: List[Relationship] = []

        # Fetch PRs, issues, commits from GitHub API
        # Create typed entities and relationships
        # ...
        return {"entities": list(entities_dict.values()),
                "relationships": relationships}

Because ingestors produce typed Entity and Relationship objects directly from API responses — not raw text — the entire LLM entity-extraction step, community detection, and community summarization become unnecessary. The type system carries the structure that MS GraphRAG has to infer. That’s where the 6x speedup comes from: typed ingestion eliminates three of the six pipeline stages.

Here’s the full data flow from source to index:

Knowledge graph and FAISS index

The graph sits on top of NetworkX’s MultiDiGraph, which gives us typed, directed, multi-edges, so a user can have both an opened_pr and a merged_pr edge to the same PR:

python

class KnowledgeGraph:
    def __init__(self, model: Optional[Model] = None):
        self.graph = nx.MultiDiGraph()
        self.model = model or Model("text-embedding-3-small")
        self._node_embeddings: Dict[str, np.ndarray] = {}

Node IDs are {type}:{name} strings (e.g., user:devalparikh, pull_request:org/repo#42). Human-readable, no collisions across types.

The graph supports incremental merging with SHA-256 fingerprint deduplication. When new data comes in, we don’t rebuild. We upsert nodes, skip duplicate edges, and only regenerate embeddings for nodes that actually changed:

python

def merge_datastore(self, datastore: DataStore) -> dict:
    merge_stats = {"nodes_added": 0, "nodes_updated": 0,
                   "edges_added": 0, "edges_skipped_duplicate": 0}

    for entity in datastore.get_all_entities():
        updated = self.upsert_node(entity)  # Fingerprint-based dedup
        if updated:
            merge_stats["nodes_added" if not existed else "nodes_updated"] += 1

    existing_sigs = self.collect_existing_edge_signatures()
    for relationship in datastore.get_all_relationships():
        if self.add_edge_deduped(relationship, existing_sigs):
            merge_stats["edges_added"] += 1

    return merge_stats

After building the graph, we embed every node and drop the vectors into a FAISS index:

python

class FAISSIndexer:
    def index_knowledge_graph(self, knowledge_graph: KnowledgeGraph):
        embeddings = knowledge_graph.get_embeddings()

        self.node_ids = list(embeddings.keys())
        self.node_types = [
            knowledge_graph.graph.nodes[nid].get("type", "unknown")
            for nid in self.node_ids
        ]

        vectors = np.array([embeddings[nid] for nid in self.node_ids])
        self.index = faiss.IndexFlatL2(vectors.shape[1])
        self.index.add(vectors.astype(np.float32))

One decision worth calling out: all node types go into the same vector space. A PR’s embedding includes its title, diff, and metadata. A user’s embedding has their username. So searching for “auth flow refactoring” can land on a PR node directly, and then graph traversal finds the connected users. We tried per-type indexes early on and this single-space approach worked better in practice.

Two-phase retrieval: search, then traverse

This is the part I’m most interested in. Standard RAG returns the K nearest neighbors and calls it a day. That works for “find me documents about authentication” but falls apart for “who should I hire to lead a payments refactor?” or “assemble a team that has experience across our API layer, database migrations, and frontend.” Those queries require following relationships, from topics to PRs to authors to the other systems they’ve touched, not just ranking documents by similarity. This is where an agentic AI system adds real value: it can intelligently decide which edges to traverse, how deep to go, and how to synthesize results across multiple hops in the graph.

We do two phases, loosely inspired by Edge et al., 2024:

Phase 1: Semantic search

Embed the query, search across all node types in FAISS. This gives us seed nodes, the entry points into the graph that are closest to the query semantically.

Phase 2: Graph traversal

From those seeds, BFS along typed edges to find entities of the target type. If someone asks “who worked on auth flow?”, Phase 1 finds PRs about auth flow; Phase 2 follows opened_pr and merged_pr edges to find the people.

The traversal is type-aware and can filter by link type. Before any of this runs, an LLM call decomposes the natural language query into structured parameters:

python

@dataclass
class QueryIntent:
    search_query: str                          # Optimized for embedding search
    target_entity_type: Optional[EntityType]   # What to return (user, PR, issue...)
    link_type: Optional[RelationshipType]      # Filter traversal edges
    max_hops: int                              # Graph distance limit
    top_k: int                                 # Number of results

“Who merged PRs about auth flow?” becomes target_entity_type=USER, link_type=MERGED_PR, search_query="auth flow". Phase 1 finds the relevant PRs, Phase 2 only follows merged_pr edges. The distinction between “who opened” and “who merged” falls out naturally from the type system.

Query understanding

The query understanding step is a single gpt-4o-mini call. The prompt describes the graph’s type system and explains how the two-phase retrieval works:

python

QUERY_INTENT_SYSTEM_PROMPT = """You are a query-understanding assistant for a
graph-based retrieval system.

The knowledge graph has:
- Node types: user, document, slack_channel, github_repo, pull_request, issue.
- Relationship types: contributes_to, messages_in, opened_pr, merged_pr,
  opened_issue, belongs_to.

Retrieval works in two phases:
1. We embed a "search query" and do nearest-neighbor search over ALL node types.
2. From those seed nodes we traverse the graph to collect nodes of a
   "target type", optionally only along a given "link type", up to max_hops.

Output valid JSON with keys: search_query, target_entity_type, link_type,
max_hops, top_k."""

The model doesn’t need to know about FAISS or BFS. It just maps natural language to the right enum values. This keeps the LLM’s job simple and the latency under 500ms.

Semantic explanations

After retrieval, we can optionally generate per-result explanations of why each node was returned. For each result, we send the query, the target node summary, the seed node it connected through, and the relationship type to a lightweight LLM:

python

def _build_semantic_explanation(self, model, query_text, result):
    prompt_payload = {
        "query": query_text,
        "target_node": summarize(result["node_data"]),
        "source_node": summarize(result["connected_via_node_data"]),
        "relationship_type": result["relationship_type"],
        "distance": result["distance"],
    }
    # Returns: {"role_description": "...", "why_good_fit": "..."}

The output is something like: “Alice opened PR #15 which refactors the OAuth token refresh flow.” These are generated in parallel, controlled by a per-request flag and a server-side toggle, so they don’t add latency unless you want them.

Where the 6x comes from

Here’s the breakdown on equivalent workloads:

Stage	MS GraphRAG	Custom GraphRAG	Difference
Entity extraction	LLM-based NER over chunks (~60% of time)	Schema-based from API responses	Typed ingestors already label entities — no LLM needed
Graph construction	Build from extracted entities	Direct from typed dataclasses	Typed dataclasses map straight to graph nodes
Community detection	Leiden algorithm	Not needed	Typed edges already encode the structure Leiden has to discover
Embedding	Embed community summaries	Embed node properties directly	Properties come pre-structured from ingestors — one embed pass, no summarization
Serialization	Parquet artifacts	FAISS index + GraphML + NumPy	Fewer, simpler files
Total (small test)	~180s	~30s	~6x faster

Here’s a visual comparison of what each pipeline actually does:

The red stages are the ones we skip entirely. They exist to solve the problem of unstructured text: extracting entities that aren’t labeled, detecting communities that aren’t predefined, summarizing those communities for retrieval. Because our ingestors produce typed Entity and Relationship objects directly from API responses, those entities are already labeled, the relationships are already explicit, and the graph structure itself replaces community detection. Typed ingestion makes the entire middle of the MS pipeline redundant.

Note: This isn’t a universal win. Typed ingestion only works when the source already exposes entities and relationships — APIs, databases, structured feeds. If your data is actually unstructured (research papers, internal docs, support tickets as free text), there are no types to ingest, and MS GraphRAG’s LLM extraction pipeline is doing real, necessary work. The speedup applies when you have structured sources where the type system is already defined by the data itself.

API and mechanism routing

Both mechanisms share a single FastAPI endpoint with SSE streaming:

python

@router.post("/query")
async def query_graphrag(request: QueryRequest):
    return StreamingResponse(
        generate_sse_stream_unified(
            mechanism=request.mechanism,  # ms_graphrag | custom_graphrag
            query_text=request.query,
            ...
        ),
        media_type="text/event-stream",
    )

The stream emits start, progress, chunk, done, and error events. You can build both indexes in a single ingestion, or just one:

python

class IngestMechanismMode(str, Enum):
    BOTH = "both"
    MS_GRAPHRAG = "ms_graphrag"
    CUSTOM_GRAPHRAG = "custom_graphrag"

    def to_mechanisms(self) -> tuple[Mechanism, ...]:
        if self == IngestMechanismMode.BOTH:
            return (Mechanism.MS_GRAPHRAG, Mechanism.CUSTOM_GRAPHRAG)
        ...

What I’d do differently

Start with the type system.

I spent time trying to make the graph generic before realizing that the types are the whole point. The EntityType and RelationshipType enums took five minutes to define and they drove every other design decision: typed ingestion meant no LLM extraction, typed edges meant no community detection, and typed node properties meant embeddings could be computed in a single pass. If your data has structure, model it explicitly — it cascades into everything downstream.

KNN is not enough for relationship queries.

This was obvious in retrospect, but I initially tried to make pure vector search work for questions like “which engineers bridge the auth and payments modules?” It can’t. That answer requires following edges from topics to PRs to authors across different parts of the graph, then intersecting the results. Two-phase retrieval handles both “find me X” and “who bridges X and Y” with one index.

Single vector space, all types.

I tried separate FAISS indexes per entity type early on. It was more complex and performed worse. Putting everything in one space means the search phase can land on any node type, and the graph traversal phase filters from there.

Keep the layers independent.

The BaseIngestor to DataStore to KnowledgeGraph to FAISSIndexer pipeline has clear boundaries. Adding a Slack ingestor didn’t touch the graph code. Each layer serializes independently. This made iteration fast.

What’s next

Replacing IndexFlatL2 with IVF or HNSW indexes for larger graphs. The flat index works fine up to maybe 100k nodes, but won't scale past that.
Streaming ingestion that updates the FAISS index as new PRs land, without reindexing everything.
Cross-tenant graph queries, where a search can traverse edges across different organizations' graphs.

Sub-agent swarm traversal

The most interesting direction is replacing the single BFS traversal with a swarm of sub-agents that explore the graph in parallel. Today, Phase 2 does a single breadth-first walk from the seed nodes. That works, but it’s sequential and treats every edge the same. A swarm-based approach would change both of those things.

The idea: after Phase 1 returns N seed nodes, spin up N sub-agents, one per seed (or per cluster of nearby seeds). Each sub-agent independently traverses its local subgraph, deciding which edges are worth following based on the query intent. A sub-agent exploring a seed PR about payment processing might prioritize reviewed_pr and merged_pr edges to find experienced reviewers, while ignoring opened_issue edges that lead to unrelated bug reports. Each sub-agent returns its local findings: ranked candidate nodes with the traversal path that led to them.

A coordinator agent then merges the results across all sub-agents: deduplicating nodes that were discovered by multiple paths, boosting candidates that were surfaced independently from different parts of the graph (a strong signal of relevance), and synthesizing cross-subgraph insights that no single traversal would find on its own. For example, one sub-agent might discover that Alice authored several payment PRs, while a separate sub-agent starting from a different seed finds that Alice also reviewed database migration PRs. Together, that paints a fuller picture of her expertise than either traversal alone.

This buys you three things: latency stays flat as the graph grows because traversals run concurrently, each sub-agent can adapt its depth and strategy to its local topology rather than applying a uniform BFS, and the merged results are richer because different seed nodes surface different perspectives on the same query.

The code is at github.com/devalparikh/GraphRag.

References

Darren Edge, Ha Trinh, Newman Cheng, et al. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv:2404.16130, 2024.
Jeff Johnson, Matthijs Douze, Herve Jegou. “Billion-scale similarity search with GPUs.” IEEE Transactions on Big Data, 2019. (FAISS)
Aric A. Hagberg, Daniel A. Schult, Pieter J. Swart. “Exploring Network Structure, Dynamics, and Function using NetworkX.” Proceedings of the 7th Python in Science Conference, 2008.