How to Build Quality Metrics Into an LLM Playable Ad Pipeline

The best LLM pipeline doesn't generate ads — it generates *measurable* ads. Without quality gates built into the generation process, an AI-powered playable ad factory is just a fancy random number generator that occasionally produces a winner. This post shows how PlayableAdStudio embeds quality metrics directly into its 8-phase LLM pipeline, measuring ad performance at generation time and using those signals to drive continuous improvement in prompt engineering and ad creative.

The Problem

Playable ad quality is notoriously hard to measure at generation time. Unlike static banners where you can A/B test headlines and CTAs in isolation, a playable ad is a complete interactive experience — a mini-game that must be engaging, on-brand, technically MRAID 3.0 compliant, and optimized for conversion across eight different ad networks simultaneously.

Traditional quality metrics are all *post-hoc*:

| Metric | When Measured | Problem |

|--------|---------------|--------|

| CTR | After launch | Too late to fix creative issues |

| Conversion rate | After launch | Expensive at $300–$500 per creative |

| Network rejection | During QA review | Wastes 3–5 day iteration cycles |

| Brand compliance | Manual review | Doesn't scale to 50+ variants |

| Bundle size | At build time | No signal on creative effectiveness |

The core challenge: you need to predict creative performance *before* spending ad budget, but the very properties that make playable ads effective — interactivity, surprise, reward loops — resist automated evaluation. A 2 MB ZIP containing an MRAID-compliant HTML5 game tells you nothing about whether users will tap, swipe, or bounce.

The Solution

PlayableAdStudio solves this by treating quality as an embedded property of the generation pipeline, not a post-generation audit. The system defines **five quality dimensions** scored at pipeline time:

1. **Structural Integrity** — Is the code syntactically valid? Does it expose MRAID lifecycle events?

2. **Narrative Coherence** — Does the hook connect logically to the CTA? Is the reward consistent with the obstacle?

3. **Visual Density** — Is the layout balanced within the 320×480 viewport?

4. **Conversion Signal Strength** — Is the CTA prominent? Does the reward create urgency?

5. **Technical Compliance** — Does output pass MRAID validation? Is the bundle under 2 MB?

These dimensions are scored on a 0–100 scale by a quality analyzer that runs *between* phases. The pipeline detects a weak hook in Phase 4 and regenerates it before wasting API calls on Phases 5–8.

Architecture

The quality system integrates into the 8-phase pipeline as follows:

|-------|-----------|-------------|-------------|

| 1 | Genre Selection | Genre-CTA fit score | 60–95 |

| 2 | Call-to-Action | CTA actionability + network compliance | 50–100 |

| 3 | Layout | Viewport coverage + element spacing | 40–95 |

| 4 | Hook | Engagement potential score | 30–90 |

| 5 | Reward/Instructions | Reward clarity + readability | 50–95 |

| 6 | Obstacles | Difficulty gradient score | 40–85 |

| 7 | ~~Polish~~ *(removed)* | — | — |

| 8 | Bonus | Bonus relevance + surprise factor | 50–90 |

The quality analyzer is a standalone JS module (~400 lines) that receives intermediate output after each phase and returns a score plus improvement suggestions. If the score falls below 55 (configurable), the pipeline regenerates that phase with the prompt augmented by quality feedback.

Implementation

The quality scoring system uses two approaches: **structural analysis** (rule-based) and **semantic analysis** (LLM-as-judge).

Structural Quality Gate (Phases 3, 5)

Rule-based checks are faster and more reliable than LLM evaluation for technical compliance:

```javascript

function scoreLayout(layoutCode) {

const checks = [

{

pass: /mraid/.test(layoutCode),

weight: 0.3,

label: 'MRAID lifecycle hooks present'

{

pass: (layoutCode.match(/\d{2,4}px/g) || [])

.filter(p => parseInt(p) > 500).length === 0,

weight: 0.25,

label: 'Elements within viewport bounds'

{

pass: /kontra/.test(layoutCode),

weight: 0.2,

label: 'Kontra.js engine initialized'

{

pass: (layoutCode.match(/<(canvas|div|button|img)/g) || []).length <= 15,

weight: 0.25,

label: 'Element count in acceptable range'

}

];

const score = checks.reduce((s, c) => s + (c.pass ? c.weight : 0), 0) * 100;

return { score: Math.round(score),

recommendations: checks.filter(c => !c.pass).map(c => c.label) };

}

```

Semantic Quality Gate (Phases 2, 4, 6)

For narrative coherence and engagement, the system uses an LLM-as-judge pattern — a lightweight DeepSeek V3 call with a strict scoring rubric:

```text

Evaluate this hook for a {genre} playable ad.

Hooks: "{hook_text}"

Score (0–10 each):

1. SURPRISE — Creates interest in the first second?

2. CLARITY — Can users understand the goal instantly?

3. URGENCY — Does it motivate immediate action?

4. GENRE_FIT — Appropriate mechanic for {genre}?

5. NETWORK_FIT — Respects {network} guidelines?

Respond with JSON: {"score": N, "recommendation": "..."}

```

Each evaluation costs ~$0.0003 via OpenRouter, making it feasible after every phase. The pipeline adds ~1 second per gate but saves regenerating an entire ad when a phase fails.

Automatic Regeneration Loop

When a phase scores below threshold, the system retries with prompt augmentation:

```javascript

async function generateWithQuality(prompt, gate, maxRetries = 2) {

for (let i = 0; i <= maxRetries; i++) {

const result = await callLLM(prompt);

const quality = await gate(result);

if (quality.score >= 55)

return { result, quality, i };

prompt += ` [Quality: ${quality.recommendations.join('; ')}]`;

}

return { failed: true }; // fallback to template

}

```

After 3 failed attempts, the pipeline falls back to a template-based default and flags the creative for manual review, preventing runaway costs.

Results

We instrumented the quality pipeline across 500 generations:

|--------|--------|-------|--------|

| First-submission approval rate | 68% | 89% | +21pp |

| Cost per ad | $0.20 | $0.24 | +$0.04 |

| CTR vs. manual control | — | +12% | +12% |

| Pipeline completion success | 82% | 96% | +14pp |

The 4-cent cost increase is offset by the 21pp approval rate improvement. Rework time dropped 70% because flagged creatives include specific recommendations rather than requiring manual reverse-engineering.

Most importantly, quality scores proved *predictive* of real-world performance. In a holdout test of 50 ads, the 25 highest-scoring creatives (avg: 78) outperformed the lowest 25 (avg: 43) by 18% on CTR across AppLovin and Mintegral campaigns.

Key Takeaways

1. **Measure at generation time, not after launch.** Embedding quality gates between phases catches issues before they compound across the pipeline. The 21pp approval rate improvement proves early detection saves real money.

2. **Combine structural rules with LLM-as-judge scoring.** Rule-based checks handle the deterministic 80% of quality issues (viewport bounds, MRAID compliance), while semantic evaluation handles the 20% that require creative judgment. This hybrid approach keeps latency low without missing subtle narrative problems.

3. **Quality scores predict conversion performance.** The holdout test showed strong correlation (r ≈ 0.63) between pipeline scores and real CTR. The quality pipeline is not just a gatekeeper — it is a predictive model that guides A/B test prioritization before spending ad budget.