Request for Advice on Improving Accuracy of Multi-source RAG Implementation

Hello Agno Team,

I hope this message finds you well. I am currently building a Retrieval-Augmented Generation (RAG) system using Agno and would like your expert advice on improving the accuracy of my implementation. Here is a detailed overview of my current setup:

Overview:

  • I have two separate Knowledge Bases:
    1. PDF Knowledge Base: Approximately 1,000 PDF documents, each about 10 pages in length.
    2. CSV Knowledge Base: Contains approximately 10,000 rows with 5 columns each, consisting of diverse content types.
  • These two sources are combined using CombinedKnowledgeBase .
  • Vector embeddings are generated using OpenAIEmbedder with the model text-embedding-3-large .
  • The retrieval system uses PgVector with hybrid search.
  • Reranking is performed using Cohere’s rerank-v3.5 .
  • The agent utilizes OpenAIChat model gpt-4o .

Despite this setup, the overall retrieval accuracy—especially for the CSV data—is lower than expected. I suspect the mixed nature of CSV content contributes significantly to the issue. Moreover, even the combined retrieval across PDF and CSV sources could be improved.

Here is the key part of my code (sensitive info masked):

DATABASE_URL = "postgresql+psycopg://<user>:<password>@localhost:5432/<database>"
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "<masked>")

# PDF Knowledge Base
pdf_kb = PDFKnowledgeBase(
    path="/path/to/pdf_files",
    vector_db=PgVector(
        table_name="pdf_documents_semantic_chunking",
        db_url=DATABASE_URL,
        embedder=OpenAIEmbedder(id="text-embedding-3-large"),
    ),
    reader=PDFReader(chunk=True),
    chunking_strategy=SemanticChunking()
)

# CSV Knowledge Base
csv_kb = CSVKnowledgeBase(
    path=Path("/path/to/csv_file.csv"),
    vector_db=PgVector(
        table_name="csv_documents_semantic_chunking",
        db_url=DATABASE_URL,
        embedder=OpenAIEmbedder(id="text-embedding-3-large"),
    ),
    chunking_strategy=SemanticChunking()
)

# Combined Knowledge Base
knowledge_base = CombinedKnowledgeBase(
    sources=[pdf_kb, csv_kb],
    vector_db=PgVector(
        table_name="combined_documents_semantic_chunking",
        db_url=DATABASE_URL,
        search_type=SearchType.hybrid,
        embedder=OpenAIEmbedder(id="text-embedding-3-large"),
    ),
    chunking_strategy=SemanticChunking()
)

# Agent initialization
agent = Agent(
    name="my_agent",
    model=OpenAIChat(id="gpt-4o"),
    description="Internal document retrieval agent.",
    knowledge=knowledge_base,
    tools=[],
    show_tool_calls=True,
    markdown=True
)

Specific questions:

  1. Given the diverse and large-scale nature of my CSV data, what strategies or modifications would you recommend to enhance retrieval accuracy?
  2. Are there alternative chunking or embedding strategies within Agno or externally that might better handle heterogeneous CSV content?
  3. Any additional recommendations for improving overall precision when combining multiple knowledge sources?

Your guidance would be greatly appreciated. Thank you very much for your support.