Dear Agno Support Team,
I’m currently developing a RAG-based search application tailored for store staff to quickly retrieve specific information from internal documents (mainly PDFs converted from PowerPoint slides, and CSV files). The goal is to achieve high accuracy, where staff input queries and receive the top five relevant documents, each associated with the precise content excerpt.
Context:
- Cost Target: Maximum of ¥0.5 per query. Current costs (using OpenAI embeddings and Cohere re-ranking) are around ¥0.3 per query.
- Document Details:
- PDFs: 1,000 files (around 10 pages each), originally from PowerPoint, thus often fragmented into single words or short phrases.
- CSVs: Approximately 10,000 rows across 5 columns.
- Accuracy Measurement: Evaluated by a spreadsheet containing expected input-output pairs (queries vs. correct files). If the expected document appears in the top five results, it counts as correct.
- Here is the key part of my code (sensitive info masked):**
DATABASE_URL = "postgresql+psycopg://USERNAME:PASSWORD@localhost:5432/DBNAME"
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "MASKED_API_KEY")
co = cohere.ClientV2(api_key=COHERE_API_KEY)
agent_description = "You are an agent for searching internal documents at Ryohin Keikaku."
# PDF Knowledge Base (PDF files are loaded from a local directory)
pdf_kb = PDFKnowledgeBase(
# Path to the PDF files (please adjust as needed)
path="/path/to/your/pdf_files_directory",
vector_db=PgVector(
table_name="pdf_documents_agentic_chunking",
db_url=DATABASE_URL
),
reader=PDFImageReader(chunk=True),
chunking_strategy=AgenticChunking()
)
# CSV Knowledge Base (CSV files are loaded from a separate directory)
csv_kb = CSVKnowledgeBase(
# Directory for CSV files (please adjust as needed)
path=Path("/path/to/your/csv_file_directory/filtered_qast.csv"),
vector_db=PgVector(
table_name="csv_documents_agentic_chunking",
db_url=DATABASE_URL
),
chunking_strategy=AgenticChunking()
)
# Combined Knowledge Base: integrates both PDF and CSV sources
knowledge_base = CombinedKnowledgeBase(
sources=[
pdf_kb,
csv_kb,
],
vector_db=PgVector(
table_name="combined_documents_agentic_chunking",
db_url=DATABASE_URL,
search_type=SearchType.hybrid,
embedder=OpenAIEmbedder(id="text-embedding-3-small"),
),
chunking_strategy=AgenticChunking()
)
# Create the Agent
agent = Agent(
name="mujigram_agent",
model=OpenAIChat(id="gpt-4o"),
description=agent_description,
knowledge=knowledge_base,
tools=[],
show_tool_calls=True,
markdown=True
)
# Load the Knowledge Base if needed (e.g., during the initial run)
if agent.knowledge is not None:
agent.knowledge.load(recreate=True)
def rerank_with_cohere(query, docs, top_n=None):
"""
Uses Cohere's rerank API to re-evaluate and reorder candidate documents.
For documents with multiple pages from the same file name, only the highest score is retained,
and finally, the top 5 documents are displayed.
"""
documents_texts = [doc.content for doc in docs]
if top_n is None:
top_n = len(documents_texts)
rerank_result = co.rerank(
model="rerank-v3.5",
query=query,
documents=documents_texts,
top_n=top_n,
return_documents=True
)
reranked_docs_with_scores = []
for res in rerank_result.results:
idx = res.index
if idx < len(docs):
doc = docs[idx]
score = res.relevance_score
reranked_docs_with_scores.append((doc, score))
# For the same file name, retain only the document with the highest score
unique_docs = {}
for doc, score in reranked_docs_with_scores:
file_key = doc.name # Assuming documents with the same file name belong together
if file_key not in unique_docs or score > unique_docs[file_key][1]:
unique_docs[file_key] = (doc, score)
# Sort in descending order of score and extract the top 5 documents
unique_docs_list = list(unique_docs.values())
unique_docs_list.sort(key=lambda x: x[1], reverse=True)
final_docs = [doc for doc, score in unique_docs_list[:5]]
print("=== Top 5 Documents ===")
for doc, score in unique_docs_list[:5]:
print(doc.name)
print("Content:", doc.content[:600])
return final_docs
Current Issues:
- Low Accuracy: Currently, accuracy is below 50% (16/33 correct).
- Misaligned Chunks: Correct documents sometimes appear in results, but the specific chunk content is slightly off from the intended reference, raising concerns about coincidental correctness.
- Poor CSV Performance: CSV file search accuracy is extremely low (0%), often referencing completely irrelevant data.
Potential Solutions under Consideration:
- Evaluating alternative chunking methods (Particularly Document, Semantic, and, Recursive because Fixed Size and Agentic both reproduced the same issues I have).
- Testing different embedding models available via OpenAI.
- Improving data quality through thorough data cleaning (currently underway).
- Possibly summarizing content at the page level prior to embedding to reduce noise and enhance relevance.
Given these points, I’d like your advice on:
- The effectiveness of each chunking strategy, specifically for slide-based PDFs and tabular CSV data.
- Recommended embedding models or methods particularly suitable for fragmented content (e.g., slides).
- Any experience or recommendations regarding pre-summarization of content as a preprocessing step.
- General best practices or alternative approaches to enhance overall retrieval accuracy, particularly for CSV data.
Your guidance would be greatly appreciated. Thank you very much for your support.