How to implement text2image Model

I wanted to implement text2image generation with Agno.

A least both Cloudflare and Together support the Flux.1-schell text2image model.

The only example of text2image is dalle, but I don’t understand why it is implemented as a tool
For me text2image (and also text2audio and text2video) are full AI models (and not LLM AI tools).

In the base class “class Model(ABC)” I see some variable for image response :
in
class MessageData:
response_role: Optional[Literal[“system”, “user”, “assistant”, “tool”]] = None
response_content: Any = “”
response_thinking: Any = “”
response_redacted_thinking: Any = “”
response_citations: Optional[Citations] = None
response_tool_calls: List[Dict[str, Any]] = field(default_factory=list)

response_audio: Optional[AudioResponse] = None
**response_image: Optional[ImageArtifact] = None**

But searching in the code I did not find any use of that

Any advice on how to implement text2image in Agno ?

Kind regards

Neuromancien

Hi @neuromancien-net , thanks for reaching out and supporting Agno. I’ve shared this with the team, we’re working through all requests one by one and will get back to you soon.If it’s urgent, please let us know. We appreciate your patience!

Wednesday, 9 April

Hi! Great question — you’re absolutely right that text-to-image (as well as text-to-audio/video) are model capabilities in their own right, and not just tools in the traditional sense.

Why it’s implemented as a tool (for now)

The Agno agent framework is centered around chat models as the core driver. Everything else (like image generation) is modeled as a tool that the agent can invoke. This simplifies the control flow — the chat model reasons about the task and then delegates specific parts (like generating an image) to tools like generate_image.

We do use response_image and response_audio internally to store images and audios generated by the model which gets stored in ModelResponse and can be accessed:

audio: Optional[AudioResponse] = None

image: Optional[ImageArtifact] = None

If you want to implement a proper text2image model:

You’re thinking in the right direction. Here’s a path forward:

Wrap your model as a Tool (recommended today)

  • Create a tool that wraps the Flux.1-schell text2image model (like generate_image does for DALL·E).
  • That tool can be used:
    • By an agent via tool calling.

This aligns with how most multimodal functionality is currently handled. But obv feel free to explore it.

Thanks for the answer.

I will have a look and will try to replicate the DALL.E code.

But if I understand correctly, this assumes that the call will be made by the LLM (and therefore the model must support tool calls).

This may not be always the case depending on the model/platform.

Kind regards

neuromancien

Yes you’re right - It does depend on the model. If the model does not support good function calling, it wont be able to use tools properly. Like gemini 2.5 flash, llama 7b one and some other smaller open sourced models are bad at function calls, so tools will be a bit finicky with them.