DeFiKit Bot Matrix: Architecting Multi-User Isolation at Scale on Telegram

DeFiKitBotMatrix is a NestJS monolith that orchestrates hundreds of independent Telegram bot instances from a single deployment, routing requests across isolated namespaces so that each user's bot operates as if it has its own dedicated server — because sharing infrastructure without isolation is just a multi-tenant disaster waiting to happen. ## The Problem — Running Many Telegram Bots Without Cross-Contamination When you run Telegram bots for multiple users on shared infrastructure, the naive approach is simple: register one token in one `Bot` instance and handle everything in a single event loop. That works for one bot. It works for two, maybe. But when you need to offer bot capabilities to dozens, then hundreds of users — each with their own Telegram bot token, their own group chats, their own state and configuration — you hit a wall. Each DeFiKit user deploys a bot into their Telegram groups. The bot responds to commands like `/create`, `/remix`, `/remove_watermark`, and `/stats`. Every user expects their bot to be fast, reliable, and above all, **private** — one user's commands and data must never leak into another user's namespace. The core challenges were: | Challenge | Detail | |-----------|--------| | **State isolation** | Each bot's session data, user preferences, and media cache must be completely separated | | **Infra sharing** | Running 100+ VPS instances is expensive and operationally unsustainable | | **Command routing** | Incoming Telegram updates must reach the correct bot instance — not just any bot | | **Graceful degradation** | One busy bot should not starve CPU from other bots on the same host | | **Scalable onboarding** | Adding a new user's bot should require zero downtime and minimal config | A simple multi-process or multi-thread approach would work up to a point, but without careful namespace isolation, the system becomes brittle. A memory leak in one bot's middleware can crash another's session. A stray global variable can route a `/create` command to the wrong user's handler. ## The Solution — Namespace-Per-Bot Pattern The DeFiKitBotMatrix architecture solves this with a **namespace-per-bot pattern**. Every bot instance gets its own hermetic environment within a single NestJS monolith process, orchestrated through AWS SQS job queues. Here is the high-level architecture: ``` ┌──────────────────────────────────────────────────────────┐ │ DeFiKitBotMatrix Monolith │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Bot #1 │ │ Bot #2 │ │ Bot #N │ ... │ │ │ grammY │ │ grammY │ │ grammY │ │ │ │ inst. │ │ inst. │ │ inst. │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ ┌────▼──────────────▼──────────────▼───────────────┐ │ │ │ Shared Middleware Stack │ │ │ │ (auth → rate-limit → logging → error-handler) │ │ │ └───────────────────────┬───────────────────────────┘ │ │ │ │ │ ┌───────────────────────▼───────────────────────────┐ │ │ │ SQS Job Queue │ │ │ │ [bot-001] [bot-002] [bot-003] ... [bot-N] │ │ │ └───────────────────────┬───────────────────────────┘ │ │ │ │ │ ┌───────────────────────▼───────────────────────────┐ │ │ │ NestJS Backend Services │ │ │ │ (create | remix | watermark | stats | state) │ │ │ └───────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────┘ ``` Each bot instance is backed by a **grammY** Telegram bot framework instance, spawned dynamically with its own middleware stack. The SQS queue decouples incoming Telegram updates from processing, allowing the system to handle bursts without dropping messages. ## Architecture Overview — The Monolith Orchestration Pattern The DeFiKitBotMatrix monolith runs on a Hetzner VPS with a PLUG server handling vertical scaling. It is not a microservices architecture — it is a **modular monolith** that uses process-level and queue-level isolation to achieve what microservices promise but with far less operational complexity. ### Key Components 1. **Bot Registry** — An in-memory + database-backed registry of all active bot instances. Each entry maps a Telegram bot token to a namespace key, a grammY instance, and a middleware chain. 2. **SQS Job Queue** — AWS Simple Queue Service receives raw Telegram updates via webhook, then fans them out to the correct bot handler based on the `update_id` and bot token. This provides backpressure: if the system is saturated, messages queue up instead of timing out. 3. **Namespace Manager** — Every bot's state (conversation sessions, media files, user preferences) is stored in a Redis-backed key-value store with keys prefixed by the bot's namespace. A bot in namespace `user_abc` stores state at `state:user_abc:*` — simple, effective, impossible to cross-contaminate. 4. **Shared Auth Layer** — All bot instances pass through a common authentication middleware that validates HMAC-signed payloads. This is the only truly shared component, and it is stateless — no namespace leak possible. 5. **Command Handlers** — Each command (`/create`, `/remix`, `/remove_watermark`, `/stats`) is implemented as a NestJS `@Injectable` service, loaded dynamically per bot instance. This means command implementations are shared (single codebase) but their **execution context** is isolated. ### Isolation Boundaries | Boundary | Mechanism | |----------|-----------| | State/Redis keys | Namespace prefix per bot | | File storage | Per-bot subdirectory in `/data/bots/{namespace}/` | | Process | Single Node.js process, no subprocess overhead | | Memory | V8 heap is shared, but bot instances use isolated objects with no global state | | Network | Each grammY instance binds via long-polling with its own token | ## Implementation — Spawning Isolated Bot Instances The core of the system is a factory that spawns grammY bot instances with isolated middleware stacks: ```typescript import { Bot, Context, session } from 'grammy'; import { NamespacedSessionStorage } from './storage/namespaced-session'; import { createAuthMiddleware } from './middleware/auth'; import { createCommandRouter } from './commands/router'; interface BotConfig { token: string; namespace: string; ownerId: number; features: string[]; } function spawnBotInstance(config: BotConfig): Bot { const bot = new Bot(config.token); const storage = new NamespacedSessionStorage(config.namespace); // Isolated middleware stack — each bot gets its own bot.use(createAuthMiddleware(config.token)); bot.use(session({ initial: () => ({}), storage })); bot.use(createRateLimiter(config.namespace)); // Register command handlers scoped to this bot's namespace const router = createCommandRouter(config.namespace, config.features); bot.command('create', router.handle('create')); bot.command('remix', router.handle('remix')); bot.command('remove_watermark', router.handle('remove_watermark')); bot.command('stats', router.handle('stats')); // Error handler scoped to this instance bot.catch((err) => { console.error(`[Bot ${config.namespace}] Error:`, err); // Namespace-specific error reporting }); return bot; } ``` Each bot is registered with the matrix manager, which handles lifecycle — starting, stopping, health-checking, and restarting: ```typescript class BotMatrixManager { private readonly bots: Map<string, Bot> = new Map(); async registerBot(config: BotConfig): Promise<void> { if (this.bots.has(config.namespace)) { await this.deregisterBot(config.namespace); } const bot = spawnBotInstance(config); await bot.start({ drop_pending_updates: true }); this.bots.set(config.namespace, bot); } async deregisterBot(namespace: string): Promise<void> { const bot = this.bots.get(namespace); if (bot) { await bot.stop(); this.bots.delete(namespace); } } } ``` The SQS consumer runs as a separate worker process that pulls messages from the queue and dispatches them. This separation ensures that a processing spike in one bot does not block message delivery to another. ## Results — Metrics in Production DeFiKitBotMatrix has been running in production on Hetzner hardware. Key metrics: | Metric | Value | |--------|-------| | Concurrent bots | 200+ per VPS | | Uptime | 99.97% over 6 months | | P95 command latency | 380ms (from Telegram update to response) | | P99 command latency | 890ms | | Memory per bot | ~12 MB baseline + ~4 MB per active session | | CPU per bot (idle) | < 0.5% of a single core | | CPU per bot (active) | ~3–5% during command processing | | Bot spin-up time | ~320ms from registration to ready | | State isolation failures | **Zero** in production | ### Scaling Trajectory - **1 VPS** (Hetzner CX22): 200 bots comfortably - **2 VPS** with PLUG load balancing: 400–500 bots - **Horizontal scale-out**: Add more VPS instances behind the shared SQS queue — the queue becomes the single source of truth for dispatching The architecture has proven that a well-designed monolith with strict namespace isolation can handle hundreds of concurrent bot instances without needing a microservices sprawl. Each Hetzner VPS runs a complete DeFiKitBotMatrix instance; the SQS queue acts as the coordination layer across hosts for future scaling. ## Key Takeaways 1. **Namespace isolation is the single most important design decision** for multi-tenant bot infrastructure. Prefix every key, every path, every session with a unique namespace identifier. This prevents the class of bugs that erodes user trust.

DeFiKit Bot Matrix: Architecting Multi-User Isolation at Scale on Telegram

Related Posts

How AIKit's Plugin Ecosystem Generates Organic Content: The Plugin-as-Content Strategy

How to Create Real Token Utility: A Playable Ad Mini-App on Telegram

Building DeFiKit: A Fair Launch Token Powered by Autonomous AI Agents