Techniques to Speed Up Ingesting 3 Million Records

Kenniferm · June 9, 2025, 8:43am

Hi agno Team,

I’m in the process of loading a JSON dataset of roughly 3 million documents into our PostgreSQL/vector store via JSONKnowledgeBase and PgVector. At this scale, the ingestion is taking much longer than is practical.

If you have any methods, best practices, or configuration tweaks that can significantly accelerate this bulk import—whether through parallelization, specialized bulk‐load paths, client settings, or other optimizations—I’d greatly appreciate any pointers or examples you can share.

Thank you for your help!

Kenniferm · June 10, 2025, 1:09am

@Monali
Hello Monali, would it be possible for you to transfer this ticket to your engineering team? Thanks in advance for your assistance.

Monali · June 10, 2025, 6:23am

Hey @Kenniferm thanks for reaching out and supporting Agno. I’ve shared this with the team, we’re working through all requests one by one and will get back to you soon.
If it’s urgent, please let us know. We appreciate your patience!

mustafa · June 10, 2025, 3:20pm

Hi @Kenniferm ! The simplest speed up would be to use async ingestion. Here is a small snippet for example:

    pg = PgVector(
        table_name="big_corpus",
        db_url="",
        search_type=SearchType.vector,
        vector_index=None,      
    )

    kb = JSONKnowledgeBase(
        path="/data/large_corpus/",   
        vector_db=pg,
        num_documents=5,           
    )

    await kb.aload(
        recreate=True,      
        skip_existing=True,  
    )

if __name__ == "__main__":
    asyncio.run(main())

Hope this helps!

Kenniferm · June 11, 2025, 3:15am

@mustafa @Monali

Hi,

Frankly, it feels like you’re just running ChatGPT on autopilot and tossing out examples without ever verifying they work. It’s painfully obvious there is no async_read method in your JSONReader, so your “async ingestion” snippet is fundamentally broken. Could you point out a clear solution, please?

mustafa · June 11, 2025, 11:38am

Hi @Kenniferm ! I verified in the codebase and well as with the team we do have “async ingestion”. You can refer to this example: agno/cookbook/agent_concepts/knowledge/json_kb_async.py at main · agno-agi/agno · GitHub

Another general tip: Try updating to latest version agno and run this.

You could combine this with the techniques I mentioned above to speed up the ingestion!

Topic		Replies	Views
AGNO JSONKnowledgeBase Load Fails with ‘Unsupported data type’ on Top-Level JSON Array General agent , knowledge	6	54	June 27, 2025
Can the URL connecting PostgresAgentStorage and PgVector use "postgresql+asyncpg"? General knowledge , feature-requests	2	33	April 15, 2025
Adding a Terminal-Based Monitoring Tool for JSON Parsing Performance General tool-call	2	38	March 7, 2025
Cannot use PgVector as vector_db in PDFKnowledgeBase General agent , knowledge , bug	2	64	March 5, 2025
Unable to chunk the JSONKnowledgeBase using chunking_strategy General knowledge	6	73	March 26, 2025

Techniques to Speed Up Ingesting 3 Million Records

Related topics