The demo is always clean. Agent receives alert, reasons through it, calls the right tool, closes the ticket. Twelve seconds.

Production is not the demo.

What actually breaks

The first thing that surprised me was how often the problem isn’t the LLM — it’s the contract between the agent and the systems around it. Tools that return inconsistent schemas. APIs that time out silently. Upstream data that’s stale by the time the agent reads it.

In a telco NOC context, this is especially sharp. Network management systems weren’t designed with agents in mind. They were designed for humans who can tolerate ambiguity, ask follow-up questions, and know when to escalate.

Agents don’t know when they don’t know.

The retry trap

Our agent was silently retrying a network query eleven times before we added proper observability. We had no idea. From the outside it looked like a slow response. Inside, it was a loop.

This isn’t an LLM problem. It’s a systems design problem. And it’s the kind of thing that only shows up when you move beyond the sandbox.

What we changed

Three things made the biggest difference:

  1. Explicit failure modes — every tool now returns a structured error with a reason field the agent can reason about, not just an HTTP 500
  2. Observability first — we instrumented agent steps before we cared about accuracy. You can’t improve what you can’t see.
  3. Escalation as a first-class action — the agent can and should say “I don’t know, here’s what I found, routing to human.” That’s not a failure. That’s the design.

The actual lesson

Agentic AI in production is less about the intelligence of the agent and more about the quality of the environment you build around it. The agent is almost never the bottleneck.

If you’re planning to put an agent into a complex operational environment — telco, logistics, finance — spend twice as long on tool contracts and observability as you do on prompt engineering. The prompt is the easy part.