How We Optimized an AI WhatsApp Chatbot to Use 40x Fewer Tokens
A real production case study with lessons we wish we knew earlier
Founder & CEO

Sven didn’t start as an experiment. It started from a real operational need.
At Publyo, we work with thousands of publishers and run dozens of campaigns in parallel. Clients expect fast answers: pricing, recommendations, and full media plans tailored to their budget. Most of these requests happen outside working hours.
That’s how the idea of a WhatsApp AI assistant was born.
An account manager that never sleeps, never delays, and never forgets.
In theory, everything was simple. In practice, things became interesting right after launch.
First signs something was wrong
After the first week in production, Sven was working exactly as intended:
• accurate responses
• relevant media plans
• natural conversations
But behind the scenes, costs were growing in a way that didn’t make sense.
We didn’t have high traffic. Fewer than 1000 conversations.
Yet token consumption was extremely high.
Not slightly over expectations.
But the kind of situation where you immediately know: something is fundamentally wrong.
The core mistake: treating AI like a black box
The initial architecture was based on a simple assumption:
The more context you provide, the better the model performs.
So we gave it everything.
Sven was running on Claude Sonnet, and for every single message it received:
• full system instructions
• the entire publisher database
• the user message
The problem was the size of the dataset.
Over 3000 websites, each containing:
• pricing
• categories
• SEO metrics
• special placements
In total: ~222,000 characters
≈ 75,000 tokens per request
Even for a simple “Hi”.
When we calculated the impact, the picture became clear:
tens of millions of tokens consumed in just one week.
Not because users were writing too much.
But because we were sending too much.
The turning point: stop sending data, start accessing it
This was the key architectural shift.
Instead of treating the prompt as a container for all knowledge, we treated it as an entry point.
We removed the entire database from the prompt and introduced a tool-calling layer (MCP-based architecture).
In practice:
• the model does NOT receive all data upfront
• it requests information only when needed
• the system returns only relevant results
The impact was immediate.
Instructions dropped from 222,000 characters to ~2,800.
A simple message no longer carries unnecessary overhead.
The paradox: less context, same quality
We expected a drop in quality.
It never happened.
Claude Sonnet continued to deliver:
• accurate recommendations
• correct media plans
• natural responses
In fact, the behavior improved.
Instead of guessing from a huge dataset, the model:
• understood the intent
• requested only relevant data
• built responses on clean, minimal context
Attempting full cost removal: local models
After optimizing the architecture, we tested whether we could remove cloud dependency entirely.
We experimented with multiple local models.
Gemma 4 26B
• strong reasoning capability
• 20+ second latency per response
• unpredictable behavior in production
Qwen 2.5 14B
• strong tool calling
• weak Romanian/European language performance
• unnatural user experience
Gemma 3 27B
• decent language quality
• heavy resource usage
• difficult to scale
Conclusion was clear
Local models are not yet ready for production systems requiring:
• natural language fluency
• precise tool calling
• fast response times
The lesson: cheaper doesn’t mean better
We also tested Claude Haiku.
On paper, it looked like a great cost optimization.
In practice:
• it ignored important parameters
• produced inconsistent outputs
• reduced media plan quality
That’s when we accepted a simple truth:
The model doesn’t need to be cheaper. The system needs to be more efficient.
Hidden optimizations that made a huge difference
After the architecture change, we refined multiple layers:
• drastically reduced system prompt size
• limited output length per response
• introduced prompt caching
• added real-time usage monitoring
Each improvement looked small individually.
Together, they completely changed the system efficiency.
Final results
Same conversations. Same logic. Same user experience.
But now:
• 40x lower token consumption
• predictable operational costs
• improved performance
Without compromising quality.
Key lessons learned
1. Never treat prompts like a database
It is the most expensive place to store information.
2. Use tools for dynamic or large datasets
This is the difference between scalable systems and expensive failures.
3. Real-world testing beats benchmarks
Benchmarks don’t reflect production behavior.
4. Monitor from day one
Cost issues scale faster than expected.
5. Architecture matters more than model choice
Optimization starts with system design, not model selection.
Final thought
Looking back, the biggest shift was not technical.
It was conceptual.
We moved from:
“How do we make AI know more?”
to:
“How do we make AI use what it knows efficiently?”
And from that moment on, everything became simpler — and significantly cheaper.

Cristian Ionita
Founder & CEO
Fondator și CEO al Oblyo Digital, cu peste 10 ani de experiență în marketing digital și tehnologie. A construit Publyo de la zero, transformând o idee simplă într-un ecosistem complet de platforme SaaS pentru content marketing. Pasionat de automatizare, API-uri și de a face tehnologia accesibilă tuturor.
