RAG multimodal (TEXT, IMAGE, TABLES)

Hi Team
I am tryign to build multimodal RAG using ollama scout and ollama nomic text embedder .
Results are not good if too much images involved. any suggestion on which open source embedder to use to get better result ?

Regards
Sathish

Hey @sathishkumar.chin, thanks for reaching out and supporting Agno. I’ve shared this with the team, we’re working through all requests one by one and will get back to you soon.
If it’s urgent, please let us know. We appreciate your patience!

Hi @sathishkumar.chin , you can try these out instead.

  • CLIP via SentenceTransformerEmbedder
    Great for image–text alignment. You can use a model like sentence-transformers/clip-ViT-B-32 to embed both image captions and text in the same vector space.
  • BLIP-2 + Text Embedder (Two-stage)
    Use BLIP-2 or GritCaption to generate detailed captions for images, then embed those captions using a strong text model like bge-small-en-v1.5 via FastEmbedEmbedder or SentenceTransformerEmbedder.
    Thanks