LLUMO AI is a SaaS tool that evaluates, debugs and optimises LLM workflows at scale. In an interview with BrandSutra, Co-Founder Akshat Anand tells us about how industries with zero tolerance for slip-ups are pushing to raise the bar in the accuracy and reliability of AI.
Additionally, he sheds insights on where LLM workflows are headed next and timeless advice for building in this whirlwind space.
Edited excerpts…
Take us back — how did the idea for LLUMO AI come about?
To understand LLUMO, you have to rewind to before ChatGPT blew up. Back then, we were building Instaminutes—a conversational intelligence tool that generated meeting summaries, action items, and insights. We were doing this in the pre-GPT era, when good language models were rare.
Things were going well—we had a stable business making over $10k in MRR. But then came GPT and dozens of LLMs. Overnight, our tech moat vanished. Suddenly, everyone could do decent summaries and action items. The market got flooded with “good enough” options.
Our sales team struggled—not because our product was bad, but because we couldn’t prove it was better. We searched for ways to benchmark outputs, but traditional metrics like BLEU needed a ground truth—which doesn’t exist for open-ended tasks like meeting summaries.
That’s when it hit us: there were no good ways to evaluate LLMs in production. And it wasn’t just our problem—every AI team we spoke to was flying blind. So we pivoted and built LLUMO: a way to evaluate, debug and optimise LLM workflows at scale.
How have you seen AI evolve in the last few years?
It’s been wild. A couple of years ago, it was all about fine-tuning—people took BERT or GPT-2, added data, and got a niche model. Today, it’s about composability. Instead of one prompt to one model, people are building LLM pipelines: retrieval, reasoning, function calling—all in one workflow.
That’s powerful, but chaotic. Each step can fail differently, so observability and debugging are huge pain points now. Another shift is the explosion of open-source models—teams aren’t just defaulting to OpenAI anymore. They’re mixing APIs, open weights, and private fine-tunes. So while we’ve moved from playgrounds to production, the tools to monitor these systems are lagging—that’s where LLUMO fits in.
Which industries are your biggest focus and how has your GTM evolved?
At first, we thought LLM evaluation was a horizontal problem. Technically, anyone with LLMs needs it. But urgency was highest in industries where AI errors cost big—legal, finance, and healthcare. A hallucination in a legal clause can mean hours of rework or real legal risk. In healthcare, bad output can put patient safety at risk.
Conversation intelligence is big for us too, since we came from that world. Sales teams told us, “Our model sometimes makes up objections that were never said!” That’s when they realised they needed a scoring layer.
So our go-to-market shifted. We started dev-first — simple SDK, plug in and see your scores. But enterprise teams with complex stacks wanted more. So now, we do more consultative onboarding: we look at their LLM stack and help them plug LLUMO into the right checkpoints. It’s not just software—it’s reliability as a service.
What was the hardest part about educating the market?
One challenge was clarifying what “evaluation” even means for LLMs. People thought it was about academic benchmarks—BLEU, ROUGE—which all need ground truths. But in real life, you don’t always have that. What’s the “right” summary of a sales call? There isn’t one.
So we reframed it: LLM evaluation is like a code review for AI outputs. It’s about relevance, factuality, tone and completeness. We open-sourced recipes for tasks like summarisation or chat QA, so teams could see it in action. That made it real—they could see their model slip up and fix it faster.
What AI trends will define the next few years?
I think we’ll see more multi-agent workflows. Instead of one giant model, people will chain specialised agents for extraction, validation, and planning. But more agents mean more points of failure—so observability will be key.
Also, smaller, domain-specific models are getting popular. Instead of just using the biggest LLM, teams fine-tune lighter models for their data and often outperform bigger ones. And LLMOps is maturing—CI/CD, prompt versioning, eval gates will become standard practice.
One piece of advice for AI founders?
Don’t just chase what’s shiny—solve a painful problem. We felt the pain ourselves when we couldn’t prove our old product was better. That gave us the conviction to pivot.
And remember—AI moves fast, but trust builds slow. Be the partner that helps your customers ship better and sleep better. That’s what lasts.
You may also like
Covid hospitalisation, family history, lifestyle behaviours behind unexplained sudden death: ICMR study
Awami League slams Yunus administration for rising rape incidents in Bangladesh
BREAKING: Marcus Fakana released after Brit teen jailed in Dubai over holiday romance
UPI goes global: UAE-based Indians can now transfer money without Indian number
BRICS leaders condemn Pahalgam massacre; seek terrorists, their backers brought to justice