High Token Usage with ThinkingTools – Guidance Needed for Agentic RAG with Custom Tooling

Hey Agno team,

I’m running into unexpectedly high token usage when enabling ThinkingTools, even on relatively simple agent runs. For example, I have an RFQ agent that looks up relevant spreadsheet and calculate the correct quote:

I’m trying to design an agentic RAG loop, where the agent:

  1. Makes a custom retrieval tool call: fetch_file_chunks(query="JFK to LAX")
  2. Reflects on whether it has enough information: ThinkingTool
  3. If not, call the tool again: fetch_file_chunks(query="fuel surcharges for JFK to LAX")
  4. Repeat until agent has all the info needed, and calculate the quote

To do this, I’m using a custom fetch_file_chunks tool instead of Agno’s built-in knowledge base, because we store and embed our data in a postgres table with pgvector vector_embedding column. So far this works — but once ThinkingTools is enabled, the token usage jumps massively and unpredictably.

A few questions:

  • Is ThinkingTools appending too much intermediate context each step (e.g. past tool call results)?
  • What’s the best practice to implement this type of loop without blowing the token budget?
  • Would love any advice for making this kind of agent work more efficiently using Agno.
  • Any plans on supporting custom RAG pipelines? I would love to take advantage of Agno’s hybrid search capabilities but this issue is preventing me since I’m using pgvector column in supabase that is not hardcoded as name column

Appreciate the help — this is such a powerful framework and I’m excited to push it further!

Agent Setup (simplified version):

from textwrap import dedent
import random

from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.tools import tool
from agno.tools.reasoning import ReasoningTools
from agno.tools.thinking import ThinkingTools
import json
from pprint import pprint


@tool()
def fetch_chunks(query: str) -> str:
		# redacted -- it's a basic hybrid search RPC function on supabase
		return result


def main():
    reasoning_agent = Agent(
        model=OpenAIChat(id="gpt-4o"),
        tools=[
            fetch_chunks,
            # uncommenting this line makes the token usage jump
            # ThinkingTools(add_instructions=True),
        ],
        instructions=dedent(
            """\
             You are a professional specializing in RFQs.
            Your task is to draft a professional, concise, and context-aware response to the latest inbound email(s) and provide a friendly and professional response.

            You will be given the following information:
            1. System Prompt: This is the company's system prompt. Use this to understand the company's policies and procedures.
            2. Email Details: This is the latest inbound email(s) and any attachments.
            3. Email Thread History: This is the history of the email thread. Use this to understand the context of the conversation.
            
            fetch_chunks tool can perform keyword/hybrid search on the source file chunks for you. Source file chunks are company's knowledge base about the different rates and services.
            Only use the information from this section when calculating rates for quotes or answering questions. Do not make up information, if unsure, make your best guess based on the source file chunks and include your assumption in the final quote email.
            When calling the fetch_chunks tool, you should provide a query that is relevant to the email details and email thread history. 

            Here are the examples of the tool calls:
            - Tool call: fetch_chunks(query="FRA")
            - Tool call: fetch_chunks(query="HKG")
            

            After each tool call, read the result and see if you have enough information to calculate the rate. If not, you should make another tool call with a different query.
            \
        """
        ),
        add_datetime_to_instructions=True,
        stream_intermediate_steps=True,
        show_tool_calls=True,
        markdown=True,
        monitoring=True,
    )

    prompt = """

=== Email Details ===

Subject: RFQ: Shipping Quote Request for Container from HKG to FRA
From: test@example.com
Body: Hello,

I need a quote for shipping 200kg worth of goods from HKG to FRA.
Please provide your best rates and transit time.

Best regards,


    """
    result = reasoning_agent.print_response(prompt, stream=True)

if __name__ == "__main__":
    main()

Postgres table w/ pgvector setup:

create table public.source_file_chunks (
  id uuid not null default extensions.uuid_generate_v4 (),
  source_file_id uuid not null,
  content text not null, -- raw text content
  vector_embedding public.vector(1536) null, -- 1536 dimension pgvector
) TABLESPACE pg_default;

Hey @tylertaewook
thanks for reaching out and supporting Agno. I’ve shared this with the team, we’re working through all requests one by one and will get back to you soon.
If it’s urgent, please let us know. We appreciate your patience!

Hey @tylertaewook to help and navigate further would you be able to provide the debug logs?
If you put debug_mode=True in the Agent config you’ll see them in terminal. I want to check how many tool calls is the agent making and if any of it is being repeated..?

The ThinkingTool shouldn’t ideally consume this many tokens.

Also from where are you monitoring the tokens exactly?
in the debug logs we also log metrics (message + tools both) can you cross confirm from there as well if the ones you mentioned and the ones logged there are same?