Token over error for Document Chunking on Japanese pdf and csv documents

Hello Agno Support Team,

I am experiencing issues when running my search agent that integrates PDF and CSV documents using Agno’s knowledge bases with DocumentChunking. When DocumentChunking is applied, I receive errors related to token limits as well as SQL insertion issues.

Token Overflow Error
The system produces the following error when processing documents (e.g., a CSV file named ‘filtered_qast’):

Error processing document 'filtered_qast': Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 5390428 tokens (5390428 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

It appears that the chunking mechanism is generating document chunks that far exceed the model’s maximum context length, causing the API request to fail.

SQL Insertion Error
Additionally, I encounter SQL errors such as:

ERROR Error with batch starting at index 0: (psycopg.errors.NotNullViolation) null value in column "id" of relation "combined_documents_document_chunking" violates not-null constraint

This suggests that when processing and inserting the combined document chunks into the vector database, a primary key value is missing. I am using text-embedding-ada-002 for embedding model.

CSV Example for Reference
To provide additional context, here is an example of one of the CSV files I am using:

question,answer,qa_status,resolved,title,comment_datetime
"このシステムの基本的な機能は何ですか?","このシステムは、全スタッフが情報を共有し、知識を蓄積するためのツールです。",解決済,◯,"システム機能について",2023-03-01 10:00:00
"業務効率を向上させるための改善点は何ですか?","業務プロセスの見直しと手順の標準化により、作業効率が向上しました。",解決済,◯,"業務改善提案",2023-03-02 11:30:00
"新しいツール導入後の効果はどのように評価されていますか?","導入後、作業時間が短縮され、生産性の向上が確認されています。",解決済,◯,"ツール導入効果",2023-03-03 14:45:00

I’ve tried semantic chunking as well, with embedding model like “text-embedding-3-small” and “text-embedding-3-largel” but it would still show up an error saying “UserWarning: Text has 9843 tokens which exceeds the model’s limit of 8191. It will be truncated.”

Given these errors, I would appreciate any guidance on the following:

  • How to configure or modify the DocumentChunking strategy to prevent generating chunks that exceed the model’s token limit.
  • Recommended best practices for processing CSV documents to avoid similar issues.
  • Any potential workarounds for the SQL insert error regarding the missing primary key values.

Thank you very much for your assistance. I look forward to your guidance.

@Monali Hello Monali, could you please assist us on this matter? It’s a bit urgent. Apologies for making you rush, and I thank you in advance.

Hey @Kenniferm, Sure.

We will try to respond as early as possible

Hey @Kenniferm! We don’t fully support japanese characters as of now and will most likely not give you the best output. We are working hard to keep on improving our SDK and will make a point to add support for Japanese characters.

Hi @manthanguptaa

Thank you for your response. However, I believe there may have been a misunderstanding regarding the core issues I’ve reported. To clarify, the problems I’m facing don’t appear to be related specifically to the Japanese characters but rather to:

  1. Token overflow errors: DocumentChunking generates chunks significantly exceeding the token limit (e.g., requesting 5,390,428 tokens against the 8,192-token limit).
  2. SQL insertion errors: Null primary key values (‘id’) are generated when inserting chunks into the vector database.

Could you please provide guidance on:

  • Adjusting DocumentChunking parameters to respect the model’s maximum token limits.
  • Resolving the primary key issue during SQL insertion.
  • Best practices or recommended configurations for processing CSV documents.

I appreciate your assistance and look forward to your clarification.

Best regards,

@manthanguptaa
Hi, could you please support as this is an urgent matter? As far as I’m concerned, Agno not supporting Japanese doesn’t mean that it causes Token overflow errors. I’d really love to have your proper support in this matter.

Hey @Kenniferm! Can you please try other chunking strategies and let us know if you are still facing the issue?