The hidden challenges of building a multi-agent LLM system for video content | discoverkit

discoverkit · 2025-08-10T07:54:27.000Z

We're working on a project to make video content searchable and conversational, and I wanted to share some of the technical challenges we've hit so far. The idea is to have a multi-agent LLM system that can "watch" a video, understand the context, and make it fully interactive. The biggest hurdle hasn't been the LLM itself, but the orchestration and data pipeline around it. For example: Cost Control: A one-hour lecture can generate a massive amount of tokens for transcription, summarization, and indexing. We've had to get creative with prompt engineering and chunking to keep our OpenAI/Anthropic costs from spiraling out of control. State Management: How do you maintain a conversational context across a 10-hour conference? You can't just feed the entire transcript back into the LLM for every query. We've had to design a system that intelligently retrieves and summarizes key segments based on the user's question before it even hits the final LLM. Latency: The "multi-agent" part means multiple API calls. Keeping the user experience fast and responsive while the system is working in the background has been a major engineering challenge. We're running into issues where a complex user query can trigger a chain of sub-queries, and it's a constant battle to optimize for speed. I'd love to hear from anyone else who's building in this space. What are your biggest technical bottlenecks, and what solutions have you found for managing costs, state, and latency with complex LLM applications?

The hidden challenges of building a multi-agent LLM system for video content | discoverkit | discoverkit