Most SME "chatbot" projects fail the same way.

You install a chatbot platform, paste in your FAQs, set a welcome message, and ship it. Six weeks later, conversion is flat, the team is still copy-pasting the same replies into WhatsApp, and someone on the finance call asks whether the £49/mo was worth it.

The problem isn't the platform. It's the architecture. One bot trying to do everything is the AI equivalent of hiring one person to cover reception, sales, and billing. It kind of works until it doesn't.

What we actually ship

On every inbound automation we've built for UK SMEs in the last nine months — dental groups, estate agencies, accountancy firms, fitness studios — the shape has converged on the same three-agent pattern:

  1. Router. Reads the message, classifies intent, hands off.
  2. Specialist. Does the actual work for that intent.
  3. Escalator. Watches for anything outside scope and gets a human involved.

Nothing exotic. Each agent has a narrow job and a clear tool belt. Together they cover 80–90% of inbound volume reliably — and the handover for the remaining 10–20% is explicit, not "the bot got confused."

Router

The router agent reads the first message and picks one of a small, fixed set of intents. For a dental practice that's:

  • book_appointment
  • cancel_or_reschedule
  • pricing_question
  • clinical_question
  • other

That's it. No free-form classification. The router returns a JSON object: { "intent": "book_appointment", "confidence": 0.91, "extracted": { "preferred_day": "Friday" } }. If confidence is below 0.7, the router hands to the escalator.

Two practical things we've learned:

  • Give the router examples, not rules. Five or six labelled examples per intent in the system prompt outperform two pages of instructions. Claude and GPT-4 class models both handle this well.
  • Log every classification for the first two weeks. Most of your quality problems are here — a wrong classification poisons everything downstream. We review 100–200 of these together with the client before calling the system "shipped."

Specialist

Once intent is set, the specialist takes over. It has access to exactly the tools it needs for that intent and nothing else.

For book_appointment in the dental stack, the specialist can:

  • Query the practice calendar (read-only)
  • Propose 3 slots
  • Write a booking back to the calendar
  • Send a confirmation via the practice's own Twilio / SMS gateway

It can't access billing. It can't see other patients' records. It can't answer clinical questions. That scope limitation is the whole point — it's what keeps the agent trustworthy and auditable.

For a multi-intent platform, you end up with 4–5 specialist agents. Don't share state between them unless you have to; the router passes forward only what's relevant.

Escalator

The escalator watches for three signals:

  1. Low confidence from the router or specialist.
  2. Keywords that legally shouldn't go through AI — clinical advice, medication questions, safeguarding, anything sensitive to your sector.
  3. Explicit user requests like "I want to speak to a real person."

On trigger, the escalator writes a clean summary to Slack or the practice WhatsApp: who the person is, what they've said, what the agents tried, and why it's being escalated. The human replies from Slack/WhatsApp; the response goes back to the customer. The agent stays out of the way.

This is the piece most first-time AI projects skip, and it's the reason they feel flaky. The escalator is the safety net that makes the rest of the system shippable.

Why not one giant agent?

Because every extra capability you give an agent increases its failure surface. A router with one job fails in one way (wrong intent). A booking specialist with three tools fails in at most three ways. A single mega-agent with twelve tools fails in combinations you can't enumerate.

There's a second reason: you want different prompts for different jobs. The router benefits from being short, snappy, and classification-focused. The specialist benefits from detailed domain knowledge. The escalator benefits from being paranoid. Jam them into one prompt and you get the worst of all three.

The build

In practice we ship this on n8n or Make for orchestration, Claude Sonnet for the agents, and whatever messaging platform the client already uses. A typical 3-agent stack takes 3–5 days to build and test for a single-location SME, 7–10 days for multi-location with real integrations.

Cost at runtime: £10–£30 per 1,000 conversations on current Anthropic API pricing. That usually pays for itself inside week one.

What we won't do

We won't "train an LLM on your data." That phrase is a red flag from a vendor — modern agents use retrieval, not training. Your FAQs, procedures, and tone samples live in a vector store the agent reads from at query time. Nothing gets baked into the model. You can update it in a spreadsheet, and the next message uses the new version.

We also won't build a single-agent stack, even if the client asks. The 3-agent pattern is more code day-one but dramatically less debugging cost over the lifetime of the system. Ship the boring architecture. Your future self will thank you.


If you're running inbound through a single chatbot that "almost works", we'd probably audit your setup and tell you it's the architecture, not the prompt. If it's the architecture, the rebuild is usually a 3–5 day Pro-tier engagement.