Request for Advice on Improving Accuracy of Multi-source RAG Implementation

Dear Agno Support Team,

I’m currently developing a RAG-based search application tailored for store staff to quickly retrieve specific information from internal documents (mainly PDFs converted from PowerPoint slides, and CSV files). The goal is to achieve high accuracy, where staff input queries and receive the top five relevant documents, each associated with the precise content excerpt.

Context:

  • Cost Target: Maximum of ¥0.5 per query. Current costs (using OpenAI embeddings and Cohere re-ranking) are around ¥0.3 per query.
  • Document Details:
    • PDFs: 1,000 files (around 10 pages each), originally from PowerPoint, thus often fragmented into single words or short phrases.
    • CSVs: Approximately 10,000 rows across 5 columns.
  • Accuracy Measurement: Evaluated by a spreadsheet containing expected input-output pairs (queries vs. correct files). If the expected document appears in the top five results, it counts as correct.
  • Here is the key part of my code (sensitive info masked):**
DATABASE_URL = "postgresql+psycopg://USERNAME:PASSWORD@localhost:5432/DBNAME"
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "MASKED_API_KEY")

co = cohere.ClientV2(api_key=COHERE_API_KEY)

agent_description = "You are an agent for searching internal documents at Ryohin Keikaku."

# PDF Knowledge Base (PDF files are loaded from a local directory)
pdf_kb = PDFKnowledgeBase(
    # Path to the PDF files (please adjust as needed)
    path="/path/to/your/pdf_files_directory",
    vector_db=PgVector(
        table_name="pdf_documents_agentic_chunking",
        db_url=DATABASE_URL
    ),
    reader=PDFImageReader(chunk=True),
    chunking_strategy=AgenticChunking()
)

# CSV Knowledge Base (CSV files are loaded from a separate directory)
csv_kb = CSVKnowledgeBase(
    # Directory for CSV files (please adjust as needed)
    path=Path("/path/to/your/csv_file_directory/filtered_qast.csv"),
    vector_db=PgVector(
        table_name="csv_documents_agentic_chunking",
        db_url=DATABASE_URL
    ),
    chunking_strategy=AgenticChunking()
)

# Combined Knowledge Base: integrates both PDF and CSV sources
knowledge_base = CombinedKnowledgeBase(
    sources=[
        pdf_kb,
        csv_kb,
    ],
    vector_db=PgVector(
        table_name="combined_documents_agentic_chunking",
        db_url=DATABASE_URL,
        search_type=SearchType.hybrid,
        embedder=OpenAIEmbedder(id="text-embedding-3-small"),
    ),
    chunking_strategy=AgenticChunking()
)

# Create the Agent
agent = Agent(
    name="mujigram_agent",
    model=OpenAIChat(id="gpt-4o"),
    description=agent_description,
    knowledge=knowledge_base,
    tools=[],
    show_tool_calls=True,
    markdown=True
)

# Load the Knowledge Base if needed (e.g., during the initial run)
if agent.knowledge is not None:
    agent.knowledge.load(recreate=True)


def rerank_with_cohere(query, docs, top_n=None):
    """
    Uses Cohere's rerank API to re-evaluate and reorder candidate documents.
    For documents with multiple pages from the same file name, only the highest score is retained,
    and finally, the top 5 documents are displayed.
    """
    documents_texts = [doc.content for doc in docs]
    if top_n is None:
        top_n = len(documents_texts)

    rerank_result = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=documents_texts,
        top_n=top_n,
        return_documents=True
    )

    reranked_docs_with_scores = []
    for res in rerank_result.results:
        idx = res.index
        if idx < len(docs):
            doc = docs[idx]
            score = res.relevance_score
            reranked_docs_with_scores.append((doc, score))

    # For the same file name, retain only the document with the highest score
    unique_docs = {}
    for doc, score in reranked_docs_with_scores:
        file_key = doc.name  # Assuming documents with the same file name belong together
        if file_key not in unique_docs or score > unique_docs[file_key][1]:
            unique_docs[file_key] = (doc, score)

    # Sort in descending order of score and extract the top 5 documents
    unique_docs_list = list(unique_docs.values())
    unique_docs_list.sort(key=lambda x: x[1], reverse=True)
    final_docs = [doc for doc, score in unique_docs_list[:5]]

    print("=== Top 5 Documents ===")
    for doc, score in unique_docs_list[:5]:
        print(doc.name)
        print("Content:", doc.content[:600])

    return final_docs

Current Issues:

  1. Low Accuracy: Currently, accuracy is below 50% (16/33 correct).
  2. Misaligned Chunks: Correct documents sometimes appear in results, but the specific chunk content is slightly off from the intended reference, raising concerns about coincidental correctness.
  3. Poor CSV Performance: CSV file search accuracy is extremely low (0%), often referencing completely irrelevant data.

Potential Solutions under Consideration:

  • Evaluating alternative chunking methods (Particularly Document, Semantic, and, Recursive because Fixed Size and Agentic both reproduced the same issues I have).
  • Testing different embedding models available via OpenAI.
  • Improving data quality through thorough data cleaning (currently underway).
  • Possibly summarizing content at the page level prior to embedding to reduce noise and enhance relevance.

Given these points, I’d like your advice on:

  • The effectiveness of each chunking strategy, specifically for slide-based PDFs and tabular CSV data.
  • Recommended embedding models or methods particularly suitable for fragmented content (e.g., slides).
  • Any experience or recommendations regarding pre-summarization of content as a preprocessing step.
  • General best practices or alternative approaches to enhance overall retrieval accuracy, particularly for CSV data.

Your guidance would be greatly appreciated. Thank you very much for your support.

@Monali Hello Monali, could you please assist us on this matter, as well? It’s a bit urgent. Apologies for making you rush, and I thank you in advance.

Hi @Kenniferm,
Apologies for the late reply. I’ve looped in the right engineers to help with your question. We will try to solve your query as early as possible

@Monali
Hi, I am waiting for your support 2days now. Please assist as soon as possible. Thanks.

Hi @Kenniferm,
Apologies for the delay
I will get an engineer on this asap

Hello @Kenniferm !

Since we’re working with a large dataset, I recommend testing each knowledge base individually. This approach will help us determine whether we’re reaching the context limit for RAG or if our data needs cleaning. By isolating each knowledge base, we can better assess its performance, identify bottlenecks, and optimize accordingly.

Additionally,

  • Evaluating alternative chunking methods (Particularly Document, Semantic, and, Recursive because Fixed Size and Agentic both reproduced the same issues I have).

I would recommend you try out strategies like: FixedSizeChunking and SemanticChunking. This is a power use case so I believe SemanticChunking should work particularly well.

  • Recommended embedding models or methods particularly suitable for fragmented content (e.g., slides).

text-embedding-3-small is great for most use cases. But I would suggest using text-embedding-3-large for your use case. It will increase your cost per query - but it is worth testing for the increase in accuracy

  • Any experience or recommendations regarding pre-summarization of content as a preprocessing step.

Since we are dealing with a large amount of data, I would recommend a general cleaning. Eg: Dropping unused columns, removing duplicates, etc. And making sure only the relevant data is being loaded

  • General best practices or alternative approaches to enhance overall retrieval accuracy, particularly for CSV data.

This is power use case for CSVs, let me do some testing on my own and get back to you. For PDF’s your approach is spot on. Specially by using the PDFImage reader

Please do let us know if you have any other questions, looking forward to hearing back from you

Hi @yash, thank you so much for your reply, I appreciate your time in this.

If possible, I have few more questions that I’d love to ask you.

I have two different data sources: PDF files and CSV files. As you can see from the code, I am combining both sources to create a single knowledge base in the end.

Currently, I am using either Fixed Size Chunking or Agentic Chunking. However, when applying these strategies to CSV files, chunks often break in the middle of a row, which is not ideal for my use case. Instead, I want to ensure that each row in the CSV file is treated as a separate chunk.

I have two main questions:

1. How can I enforce row-level chunking for CSV files?

Are there any recommended methods, references, or code examples I can refer to in order to achieve this?

2. Can I combine different chunking strategies into a single knowledge base?

I would like to apply Agentic or Semantic Chunking for PDF files, while using a different chunking approach (row-based) for CSV files. Eventually, I want to enable users to search across both sources in a unified way, retrieving the most relevant information regardless of its origin.

From my understanding, one approach would be to first process PDFs and CSVs separately with their respective chunking methods, and then merge them into a single knowledge base using CombinedKnowledgeBase. However, I am not necessarily attached to this approach.

If there’s a better way to structure the system—such as maintaining separate chunking methods but ensuring they can be queried seamlessly together—I would love to hear your suggestions.

Looking forward to your insights!

Best regards,

Hey @Kenniferm! You can use semantic chunking for your use case as it splits documents into smaller chunks by analyzing semantic similarity between text segments using embeddings.

Yes, you can use different chunking strategies for your knowledge bases. The chunking strategy is defined on the knowledge base level. So, if you have 2 knowledge base PDFKnowledgeBase and CSVKnowledgeBase then you can have 2 different chunking strategies