Token consumption in Microsoft’s Graph RAG – baeke.info

In the previous post, we discussed Microsoft’s Graph RAG implementation. In this post, we will take a look at token consumption to query the knowledge graph, both for local and global queries.

Note: this test was performed with gpt-4o. A few days after this blog post, OpenAI released gpt-4o-mini. Initital tests with gpt-4o-mini show that index creation and querying work well with a significantly lower cost. You can replace gpt-4o with gpt-4o-mini in the setup below.

Table of Contents

Setting up Langfuse logging

To make it easy to see the calls to the LLM, I used the following components:

LiteLLM: configured as a proxy; we configure Graph RAG to use this proxy instead of talking to OpenAI or Azure OpenAI directly; see https://www.litellm.ai/
Langfuse: an LLM engineering platform that can be used to trace LLM calls; see https://langfuse.com/

To setup LiteLLM, follow the instructions here: https://docs.litellm.ai/docs/proxy/quick_start. I created the following config.yaml for use with LiteLLM:

model_list:
 - model_name: gpt-4o
   litellm_params:
     model: gpt-4o
 - model_name: text-embedding-3-small
   litellm_params:
     model: text-embedding-3-small
litellm_settings:
  success_callback: ["langfuse"]

Before starting the proxy, set the following environment variables:

export OPENAI_API_KEY=my-api-key
export LANGFUSE_PUBLIC_KEY="pk_kk"
export LANGFUSE_SECRET_KEY="sk_ss"

You can obtain the values from both the OpenAI and Langfuse portals. Ensure you also install Langfuse with pip install langfuse.

Next, we can start the proxy with litellm --config config.yaml --debug.

To make Graph RAG work with the proxy, open Graph RAG’s settings.yaml and set the following value under the llm settings:

api_base: http://localhost:4000

LiteLLM is listening for incoming OpenAI requests on that port.

Running a local query

A local query creates an embedding of your question and finds related entities in the knowledge graph by doing a similarity search first. The embeddings are stored in LanceDB during indexing. Basically, the results of the similarity search are used as entrypoints into the graph.

That is the reason that you need to add the embedding model to LiteLLM’s config.yaml. Global queries do not require this setting.

After the similar entities have been found in LanceDB, they are put in a prompt to answer your original question together with related entities.

A local query can be handled with a single LLM call. Let’s look at the trace:

The query took about 10 seconds and 11500 tokens. The system prompt starts as follows:

e0ad8 image 8 — First part of local query system prompt

The actual data it works with (called data tables) are listed further in the prompt. You can find a few data points below:

f4fd2 image 9 — Entity about Winston Smith, a character in the book 1984 (just a part of the text)

d0625 image 10 — Entity for O’Brien, a character he interacts with

The prompt also contains sources from the book where the entities are mentioned. For example:

The response to this prompt is something like the response below:

9dc2e image 12 — LLM response to local query

The response contains references to both the entities and sources with their ids.

Note that you can influence the number of entities retrieved and the number of consumed tokens. In Graph RAG’s settings.yaml, I modified the local search settings as follows:

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  top_k_mapped_entities: 5
  top_k_relationships: 5
  max_tokens: 6000

The trace results are clear: token consumption is lower and the latency is lower as well.

Of course, there will be a bit less detail in the answer. You will have to experiment with these values to see what works best in your scenario.

Global Queries

Global queries are great for broad questions about your dataset. For example: “What are the top themes in 1984?”. A global query is not a single LLM call and is more expensive than a local query.

Let’s take a look at the traces for a global query. Every trace is an LLM call to answer the global query:

56060 image 14 — Traces for a global query

The last one in the list is where it starts:

5a4d3 image 15 — First call of many to answer a global query

As you can probably tell, the call above is not returned directly to the user. The system prompt does not contain entities from the graph but community reports. Community reports are created during indexing. First, communities are detected using the Leiden algorithm and then summarized. You can have many communities and summaries in the dataset.

This first trace asks the LLM to answer the question: “What are the top themes in 1984?” to a first set of community reports and generates intermediate answers. These intermediate answers are saved until a last call used to answer the question based on all the intermediate answers. It is entirely possible that community reports are used that are not relevant to the query.

Here is that last call:

ba86b image 16 — Answer the question based on the intermediate answers

I am not showing the whole prompt here. Above, you see the data that is fed to the final prompt: the intermediate answers from the community reports. This then results in the final answer:

040d9 image 17 — Final answer to the global query

Below is the list with all calls again:

58134 image 18 — All calls to answer a global query

In total, and based on default settings, 12 LLM calls were made consuming around 150K tokens. The total latency cannot be calculated from this list because the calls are made in parallel. That total cost is around 80 cents.

The number of calls and token cost can be reduced by tweaking the default parameters in settings.yaml. For example, I made the following changes:

global_search:
  max_tokens: 6000 # was 12000
  data_max_tokens: 500 # was 1000
  map_max_tokens: 500 # was 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

However, this resulted in more calls with around 140K tokens. Not a big reduction. I tried setting lower values but then I got Python errors and many more LLM calls due to retries. I would need to dig into that further to explain why this happens.

Conclusion

From the above, it is clear that local queries are less intensive and costly than global queries. By tweaking the local query settings, you can get pretty close to the baseline RAG cost where you return 3-5 chunks of text of about 500 tokens each. Latency is pretty good as well. Of course, depending on your data, it’s not guaranteed that the responses of local search will be better that baseline RAG.

Global queries are more costly but do allow you to ask broad questions about your dataset. I would not use these global queries in a chat assistant scenario consistently. However, you could start with a global query and then process follow-up questions with a local query or baseline RAG.

Token consumption in Microsoft’s Graph RAG – baeke.info

Setting up Langfuse logging

Running a local query

Global Queries

Conclusion

Like this:

Related

Projesh Kar

Leave a Comment Cancel reply

Setting up Langfuse logging

Running a local query

Global Queries

Conclusion

Share this:

Like this:

Related

Projesh Kar

Leave a Comment Cancel reply