Building Resilient Long-Running AI Agents: A Guide to Durable Sessions

Introduction

As AI agents evolve from simple chatbots to long-running processes that reason, call tools, and maintain context over hours, the traditional HTTP request-response model begins to fail. You’ve likely experienced the frustration: an agent that drops its state when you switch tabs, or a tool call that times out because the connection wasn’t designed for sustained interaction. This guide walks you through the core problem and shows you how to implement a durable session layer—the same approach used by platforms like Ably—to keep your agents alive, responsive, and in sync across devices.

Building Resilient Long-Running AI Agents: A Guide to Durable Sessions
Source: thenewstack.io

What You Need

Step 1: Recognize HTTP Limitations for Long-Running Agents

HTTP is perfect for quick, one-shot completions—ask a chatbot a question, get an answer. But when your agent needs to perform dozens of tool calls across multiple reasoning steps, HTTP’s stateless nature becomes a liability. Connections drop, users switch tabs, or they interrupt the agent mid-stream. The standard request/response flow wasn’t designed for these scenarios. Jump to Step 2 for the solution.

As Matthew O’Riordan, CEO of Ably, puts it: “HTTP is exactly what you need to get up and running. But expectations have shifted because we’re all engaging with ChatGPT and Claude.” Users now expect seamless continuity across tabs and devices—something HTTP alone cannot guarantee.

Step 2: Identify Requirements for Durable Sessions

To solve the HTTP problem, you need a durable session layer. This goes beyond simple streaming. A durable session must cover:

The term “durable sessions” was first popularized by EMQX (the MQTT broker) and later by ElectricSQL for AI use cases. It’s preferred over “durable streams” because streams are only one piece of the puzzle.

Step 3: Choose a Durable Session Layer

You can build your own, but it’s complex. Platforms like Ably were originally designed for human collaboration (real-time presence, ordering, reconnection) and now naturally extend to AI agents. See Step 4 for implementation details.

When evaluating a solution, look for:

Step 4: Implement Presence and State Management

Once you have a durable session layer, start by declaring the agent’s presence. For example, when an agent begins a long reasoning task, publish a presence event so all subscribed clients know it’s active. Store intermediate states (e.g., tool call results) in a shared key-value store tied to the session ID. Use the platform’s pub/sub channels to broadcast updates to all listening tabs.

Building Resilient Long-Running AI Agents: A Guide to Durable Sessions
Source: thenewstack.io

Example flow:

  1. Agent starts task and announces its presence on channel agent:session123.
  2. Agent makes tool call – result stored in shared state.
  3. If user closes tab and opens a new one, the new client subscribes to the same channel, retrieves current state, and resumes without data loss.

Step 5: Handle Reconnection and Multi-Device Sync

When a connection drops (e.g., network outage or user switches devices), the durable session layer must automatically reconnect and restore the agent’s context. This is where HTTP’s lack of state hurts most. Your implementation should:

Test with scenarios like: “Open agent in Tab A, start reasoning, switch to Tab B, interrupt agent, go back to Tab A – does state sync?”

Step 6: Test and Iterate

Simulate real-world conditions: slow networks, rapid tab switching, mid-stream user interruptions. Use tools like Ably’s debug console or MQTT test clients. Verify that presence updates, state persistence, and reconnection all work without data loss. Iterate based on your findings.

Tips for Success

By following these steps, you’ll move beyond HTTP’s limitations and deliver AI agents that feel as robust and seamless as the best chat experiences your users expect.

Recommended

Discover More

New Supply Chain Attack Targets SAP Developers: npm Packages Weaponized with Credential-Stealing MalwareBehind the Scenes: How Fraudsters Manipulate Credit Union Loan ProcessesHow to Set Up Continuous Profiling at Scale with Pyroscope 2.0Exclusive: 'Fast16' Malware – US-Linked Cyber Sabotage Tool Silently Crippled Iran Before Stuxnet, Researchers RevealThe Ancient Mystery of the Twisted-Jaw Creature: Tanyka amnicola