While reviewing the Agno evaluation module, I noticed that when AccuracyEval and ReliabilityEval log results to the database, the stored model_id and model_provider correspond to the evaluated Agent/Team’s model, not the judge (evaluation) model.
Relevant source:
- @.venv/Lib/site-packages/agno/eval/reliability.py:264-292
if self.db:
if self.agent_response is not None:
agent_id = self.agent_response.agent_id
team_id = None
model_id = self.agent_response.model
model_provider = self.agent_response.model_provider
elif self.team_response is not None:
agent_id = None
team_id = self.team_response.team_id
model_id = self.team_response.model
model_provider = self.team_response.model_provider
eval_input = {
"expected_tool_calls": self.expected_tool_calls,
}
log_eval_run(
db=self.db,
run_id=self.eval_id, # type: ignore
run_data=asdict(self.result),
eval_type=EvalType.RELIABILITY,
name=self.name if self.name is not None else None,
agent_id=agent_id,
team_id=team_id,
model_id=model_id,
model_provider=model_provider,
eval_input=eval_input,
)
- @.venv/Lib/site-packages/agno/eval/accuracy.py:567-602
# Log results to the Agno DB if requested
if self.agent is not None:
agent_id = self.agent.id
team_id = None
model_id = self.agent.model.id if self.agent.model is not None else None
model_provider = self.agent.model.provider if self.agent.model is not None else None
evaluated_component_name = self.agent.name
elif self.team is not None:
agent_id = None
team_id = self.team.id
model_id = self.team.model.id if self.team.model is not None else None
model_provider = self.team.model.provider if self.team.model is not None else None
evaluated_component_name = self.team.name
if self.db:
log_eval_input = {
"additional_guidelines": self.additional_guidelines,
"additional_context": self.additional_context,
"num_iterations": self.num_iterations,
"expected_output": self.expected_output,
"input": self.input,
}
log_eval_run(
db=self.db,
run_id=self.eval_id, # type: ignore
run_data=asdict(self.result),
eval_type=EvalType.ACCURACY,
agent_id=agent_id,
team_id=team_id,
model_id=model_id,
model_provider=model_provider,
name=self.name if self.name is not None else None,
evaluated_component_name=evaluated_component_name,
eval_input=log_eval_input,
)
My understanding:
-
The evaluation is performed by a judge model in some setups.
-
But the DB fields record the evaluated model’s identity instead.
Questions:
-
Is this an intentional design choice (i.e., eval results should be attributed to the evaluated model)?
-
If so, where should the judge model metadata be captured, if at all?
-
If not intentional, would it make sense to add judge model info to eval logs?
Thanks for any clarification!