I use the agent to execute the code with the same logic and the same prompt, why is the agent executing much slower than directly calling the llm model?
Hey @shire
Thank you for reaching out.
Can you pls share the agent config with us to assist you better?
Are you using stream=True
? And how exactly are you calling the LLM model directly?
from agno.agent import Agent
from agno.models.openai import OpenAILike
from textwrap import dedent
import time
from agno.utils.pprint import pprint_run_response
reasoning_agent = Agent(
model=OpenAILike(
id=“”,
api_key=“”,
base_url=“”),
stream=True,
debug_mode=True,
instructions=dedent(“”"
You’re a text classifier. You need to categorize the user’s questions into 2 categories, namely: simple/complex
Here’s description of each category:
--------------------
Category: simple
Description: This type of question is used for information query, search, retrieval, and obtaining details, commonly used for directly querying specific data or detailed content.
--------------------
Category: complex
Description: This type of question is used for statistics, analysis, aggregation, comparison, trends, and summary operations, typically requiring processing of large amounts of data or multi-field, multi-condition analysis.
You could learn from the following examples:
- Question: Query transaction details for user Zhang San. Category: simple
- Question: Please provide sales records for product A over the past year. Category: simple
- Question: Analyze Zhang San's transaction trend changes over the past three months. Category: complex
- Question: Compare sales changes across regions between 2023 and 2024. Category: complex
You could learn from the above examples.
Just mention the category names, no need for any additional words.
"""),
)
questions = [
“What is the software number for SZ205728 station PRS-753A-DA-G(16th version)?”,
“What is the working power supply for SZ143036-2 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ152195 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ152323-1 station? What is the rated current of the protection device?”,
“What is the working power supply for SZ160273-2 station? What is the rated current of the protection device?”
]
for q in questions:
t0 = time.time()
response = reasoning_agent.run(q,stream=True)
pprint_run_response(response, markdown=True)
Yes, I set stream=True as well,I have a code example that calls the large model directly: messages = [{“role”: “system”, “content”: prompt},
{“role”: “user”, “content”: “What is the working power supply for SZ143036-2 station? What is the rated current of the protection device?”}]
import time
start = time.time()
client1 = openai.Client(api_key=“not empty”, base_url=f"")
client1 = OpenAI(api_key=“”, base_url=“”)
response = client1.chat.completions.create(
model=“”,
messages=messages,
stream=False,
n=3,
temperature=0.7
)
Hey @shire,
Our overhead is very minimal. It also depends on what exactly they’re comparing. Our print_response
might appear slower because it uses rich
for pretty-printing, but that’s just for debugging purposes.
I also noticed you’re using the reasoning agent and native one is directly returning a plain response — that could be one of the reasons for the speed difference?
Actually, I have a statistical time, which is more than 2 seconds slower than direct calling, and complex tasks are even slower