The agent replies too slowly

shire · June 2, 2025, 6:04am

I use the agent to execute the code with the same logic and the same prompt, why is the agent executing much slower than directly calling the llm model?

Monali · June 2, 2025, 6:09am

Hey @shire
Thank you for reaching out.
Can you pls share the agent config with us to assist you better?

pritiagno · June 2, 2025, 5:38pm

Are you using stream=True? And how exactly are you calling the LLM model directly?

shire · June 3, 2025, 6:18am

from agno.agent import Agent
from agno.models.openai import OpenAILike
from textwrap import dedent
import time
from agno.utils.pprint import pprint_run_response

reasoning_agent = Agent(
model=OpenAILike(
id=“”,
api_key=“”,
base_url=“”),
stream=True,
debug_mode=True,
instructions=dedent(“”"
You’re a text classifier. You need to categorize the user’s questions into 2 categories, namely: simple/complex
Here’s description of each category:
--------------------
Category: simple
Description: This type of question is used for information query, search, retrieval, and obtaining details, commonly used for directly querying specific data or detailed content.

    --------------------
    Category: complex
    Description: This type of question is used for statistics, analysis, aggregation, comparison, trends, and summary operations, typically requiring processing of large amounts of data or multi-field, multi-condition analysis.

    You could learn from the following examples:
    - Question: Query transaction details for user Zhang San. Category: simple
    - Question: Please provide sales records for product A over the past year. Category: simple
    - Question: Analyze Zhang San's transaction trend changes over the past three months. Category: complex
    - Question: Compare sales changes across regions between 2023 and 2024. Category: complex
    You could learn from the above examples.
    Just mention the category names, no need for any additional words.
                    
                    """),
)

questions = [
“What is the software number for SZ205728 station PRS-753A-DA-G(16th version)?”,
“What is the working power supply for SZ143036-2 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ152195 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ152323-1 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ160273-2 station? What is the rated current of the protection device?”
]

for q in questions:
t0 = time.time()
response = reasoning_agent.run(q,stream=True)
pprint_run_response(response, markdown=True)

shire · June 3, 2025, 6:20am

Yes, I set stream=True as well,I have a code example that calls the large model directly: messages = [{“role”: “system”, “content”: prompt},
{“role”: “user”, “content”: “What is the working power supply for SZ143036-2 station? What is the rated current of the protection device?”}]
import time
start = time.time()

client1 = openai.Client(api_key=“not empty”, base_url=f"")

client1 = OpenAI(api_key=“”, base_url=“”)

response = client1.chat.completions.create(
model=“”,
messages=messages,
stream=False,
n=3,
temperature=0.7
)

pritiagno · June 4, 2025, 8:05pm

Hey @shire,
Our overhead is very minimal. It also depends on what exactly they’re comparing. Our print_response might appear slower because it uses rich for pretty-printing, but that’s just for debugging purposes.

I also noticed you’re using the reasoning agent and native one is directly returning a plain response — that could be one of the reasons for the speed difference?

shire · June 10, 2025, 3:44am

Actually, I have a statistical time, which is more than 2 seconds slower than direct calling, and complex tasks are even slower

Topic		Replies	Views
Multi-agent which model are you guys using as a team agent? General agent	4	103	March 20, 2025
Talk with code kind of agent General agent	4	76	March 19, 2025
LLM as endpoint General agent	2	29	April 26, 2025
Why is Agent.run() calling the model four times, and why does num_history_responses=2 still include more history? General agent , tool-call , bug	2	58	March 20, 2025
How do i get usage metrics when using a model General agent , knowledge	3	42	June 2, 2025

The agent replies too slowly

client1 = OpenAI(api_key=“”, base_url=“”)

Related topics