Data-Driven Bot Optimization: How DeFiKit Uses LLMs to Analyze Trading Performance and Auto-Tune Strategies

The Optimization Problem

Trading bot strategies drift. Market regimes shift from trending to ranging. Volatility regimes change. Slippage patterns evolve as liquidity pools grow or shrink. A strategy that crushed it in January can bleed capital in February.

Most bot operators handle this by manually reviewing performance every few weeks and tweaking parameters. For DeFiKit's multi-agent system, that approach doesn't scale. When you have 15+ bots running across Solana, KuCoin, and HyperLiquid, each with its own strategy configuration, manual tuning becomes impossible.

This post covers how DeFiKit built an LLM-powered optimization pipeline that automatically analyzes bot performance data and proposes strategy adjustments — closing the feedback loop between analytics and execution.

The Optimization Pipeline

The pipeline runs as a weekly cron job on Cloudflare Workers. It has four stages:

1. **Data Collection** — Pull performance metrics from the analytics D1 database

2. **Analysis** — Feed the data to an LLM for pattern recognition and anomaly detection

3. **Recommendation** — Generate concrete parameter adjustment suggestions

4. **Review** — Present recommendations to the operator via Telegram for approval

```

[D1 Analytics] → Data Collection → LLM Analysis → Recommendations → Telegram Review → Approval → Apply

```

The key insight is that the LLM never executes changes automatically. It generates recommendations; a human reviews and approves. This keeps a safety layer while automating the time-consuming analysis work.

Stage 1: Data Collection

Every Sunday at 02:00 UTC, the pipeline queries D1 for the past 7 days of performance data per bot:

```typescript

const weeklyData = await db.prepare(`

SELECT

bot_id,

strategy,

pair,

COUNT(*) as total_signals,

SUM(CASE WHEN event_type = 'trade_open' THEN 1 ELSE 0 END) as trades,

SUM(CASE WHEN event_type = 'trade_close' THEN 1 ELSE 0 END) as closed_trades,

AVG(json_extract(metadata, '$.pnl_usd')) as avg_pnl,

SUM(json_extract(metadata, '$.pnl_usd')) as total_pnl,

AVG(json_extract(metadata, '$.latency_ms')) as avg_latency,

SUM(CASE WHEN event_type = 'error' THEN 1 ELSE 0 END) as error_count

FROM bot_events

WHERE timestamp > datetime('now', '-7 days')

GROUP BY bot_id

ORDER BY total_pnl DESC

`).all();

```

The result is a structured dataset grouped by bot, including trade volume, P&L, latency, and error metrics.

Stage 2: LLM Analysis Prompt

The collected data is formatted into a structured prompt for the LLM. We use DeepSeek V4 (cost-effective) for this analysis:

```

You are a quantitative trading analyst. Review the following bot performance data from the past 7 days.

For each bot:

1. Identify whether performance is improving, declining, or stable

2. Detect anomalies (unusual patterns in trade frequency, PnL, or latency)

3. Suggest parameter adjustments with specific values

4. Flag any bots that should be paused for manual review

Bot Data:

[weekly_data_json]

Previous week baseline:

[baseline_json]

Historical warnings:

[warnings_json]

Respond in this exact JSON format:

{

"bots": [

{

"bot_id": "solana_sniper_01",

"assessment": "declining",

"confidence": 0.85,

"findings": ["Trade frequency dropped 40% vs last week", "Slippage increased from 0.3% to 1.2%"],

"recommendations": [

{"parameter": "max_slippage_bps", "current": 50, "suggested": 100, "rationale": "Increasing to accommodate higher volatility in memecoin pairs"},

{"parameter": "min_liquidity_usd", "current": 50000, "suggested": 100000, "rationale": "Higher threshold to avoid low-liquidity traps"}

"action": "review",

"urgency": "high"

}

"summary": "Overall portfolio down 3.2% vs last week. Two bots flagged for high-urgency review.",

"warnings": ["solana_sniper_01 error rate up 300% — possible RPC issues"]

}

```

The structured JSON output is critical. It lets the pipeline parse the LLM's analysis programmatically without needing a second LLM call to extract meaning.

Stage 3: Risk Fallback Validation

Before recommendations reach the operator, they pass through a hardcoded risk filter. This is a safety net — no matter what the LLM suggests, certain changes are blocked:

```typescript

const SAFETY_LIMITS: Record<string, { min: number; max: number }> = {

'max_slippage_bps': { min: 10, max: 300 },

'position_size_usd': { min: 10, max: 1000 },

'stop_loss_pct': { min: 1, max: 50 },

'take_profit_pct': { min: 1, max: 200 },

'min_liquidity_usd': { min: 1000, max: 500000 },

};

function validateRecommendation(rec: Recommendation): boolean {

const limits = SAFETY_LIMITS[rec.parameter];

if (!limits) return true; // unknown parameter, let human decide

return rec.suggested >= limits.min && rec.suggested <= limits.max;

}

```

Any recommendation outside these bounds is silently dropped (with a log entry) rather than presented to the operator. This prevents a hallucinated extreme value from accidentally getting approved.

Stage 4: Telegram Review Workflow

The final stage delivers a concise summary to the operator's Telegram:

```

📊 DeFiKit Weekly Optimization Report

📉 SOL-Sniper-01 — Assessment: DECLINING (confidence: 85%)

• Trade frequency dropped 40% vs last week

• Slippage increased from 0.3% to 1.2%

• 🟡 Suggest: max_slippage_bps 50 → 100

• 🟡 Suggest: min_liquidity_usd 50K → 100K

📈 XRP-Ichimoku-01 — Assessment: STABLE (confidence: 92%)

• Performance within expected range

• No adjustments recommended

⚠️ Actions needed:

/approve SOL-Sniper-01

/pause SOL-Sniper-01

```

The operator can tap `/approve sol_sniper_01` to apply the recommendations, or `/pause bot_name` to halt trading. All actions are confirmed with a second message to prevent fat-finger errors.

Real-World Results

After running this pipeline for 8 weeks on DeFiKit's production bots:

| Metric | Before Pipeline | After Pipeline |

|--------|----------------|----------------|

| Weekly review time | 3+ hours (manual) | 4 minutes (approval only) |

| Strategy adjustments | Every 2-3 weeks | Every week, data-driven |

| Drawdown events caught early | 2 of 7 (29%) | 6 of 8 (75%) |

| Average weekly P&L change | -1.2% | +2.1% |

| Operator satisfaction | "I hate doing these reviews" | "This is actually useful" |

The pipeline caught a critical issue in week 3: the Solana sniper bot's error rate spiked because an RPC provider had silently deprecated its WebSocket endpoint. The LLM flagged the pattern (increased connection errors + stable trade volume) which a manual review might have taken days to notice.

Lessons Learned

LLM Bias Toward Action

We noticed the LLM wanted to adjust parameters even when the data showed stable performance. Adding an explicit "STABLE" assessment category with a confidence threshold (above 80%) reduced unnecessary recommendation noise by 60%.

Context Length Matters

The full 7-day dataset for 15 bots is ~8,000 tokens. Including 4 weeks of historical baselines pushes it to ~12,000 tokens. DeepSeek V4 handles this easily, but we compress historical data by only including the past 4 weekly summary rows per bot, not the raw events.

Operator Trust is Earned

In the first two weeks, operators overrode 73% of LLM recommendations. By week 8, they accepted 61%. The improvement came from showing the LLM's reasoning and confidence scores, not just the final recommendation. When operators could see *why* the LLM suggested something, they trusted it more.

Next Steps

We're building a backtesting integration: before presenting a recommendation, the pipeline simulates the proposed change against historical data using the existing Freqtrade backtesting engine. If the backtest shows expected improvement >5%, the recommendation gets an "AI verified" badge. If the backtest shows regression, the recommendation is automatically rejected with the evidence presented to the operator.

This closes the loop completely: data → analyze → simulate → recommend → approve → execute → measure → repeat.