When you build an LLM-powered playable ad generator, every prompt you send, every phase your pipeline runs, and every artifact it produces is marketing signal. The question is what you do with that signal.

PlayableAd Studio (playableadstudio.com) was designed as a Cloudflare Pages application that generates MRAID-compliant playable ads from natural language prompts. It uses a Bring Your Own Key (BYOK) architecture, streams responses via Server-Sent Events (SSE), and targets 8+ ad networks including Vungle, TikTok/Pangle, Google, Meta, Unity, and AppLovin. But somewhere between the scaffolding and the sandbox runtime verification, something interesting happened: the LLM pipeline itself became a marketing analytics engine.

The Problem: The Black Box of Playable Ad Performance

Playable ads generate 30-70% higher conversion rates than video ads in gaming campaigns, but they suffer from a fundamental analytics problem. A video ad is a flat file - you measure views, CTR, and installs. A playable ad is an interactive HTML/JS bundle - it has a game loop, level progression, user interaction patterns, and rendering performance characteristics that directly impact user acquisition costs. Ad networks report installs and impressions, but they don't tell you whether your game loop mechanic engages users, whether the polish phase reduced early-exit rates, or whether the sandbox caught a rendering bug that cratered FPS on low-end devices.

Marketers look at CPI. Developers look at build logs. The two views never converge. There is no feedback loop between generation decisions and outcomes - just hope and re-runs.

The Solution: The Pipeline as Instrumentation

PlayableAd Studio's architecture solves this by treating every phase of the LLM generation pipeline as an instrumented data collection point. The pipeline:

```

plan -> scaffold -> gameplay -> polish -> critique -> sandbox-verify -> repair -> variation -> finish-genre -> bundle

```

...is not just a code generation sequence. It's a measurement framework. Each phase produces structured output that feeds both the next phase *and* an evaluation layer that scores quality, tracks regressions, and informs rollout decisions.

The key insight is the **Evaluation Layer** - a bounded context separate from the execution orchestrator. Its job isn't to help generate ads; it's to score them, compare them against baselines, maintain a quality history, and determine whether a template version is ready for production. This is a marketing analytics system embedded inside a developer tool.

Architecture Overview: Five Bounded Contexts

PlayableAd Studio decomposes into five bounded contexts:

| Context | Role | Marketing Function |

|---|---|---|

| **Experience Layer** | User input, run initiation, result display | Campaign brief capture |

| **Execution Layer** | Phase orchestration, retry/abort | Generation funnel analytics |

| **Template Registry** | Template versions, quality gates | Creative template A/B testing |

| **Evaluation Layer** | Score, baseline comparison, regression | Primary marketing analytics |

| **Artifact & Evidence** | Output storage, audit trail | Creative performance correlation |

The Marketing-Feedback Loop

The Evaluation Layer creates the closed feedback loop:

1. **Run begins:** Template version, prompt version, and scoring policy are pinned to the snapshot - analytics instrumentation from step one.

2. **Phase execution:** Each phase produces structured evidence rich with marketing data: plan.levels.length tells you ad complexity; sandbox.fps_average correlates with user experience.

3. **Quality gates:** Static and runtime gates produce scored, versioned results feeding evaluation history.

4. **Scoring formula:** 0.20*plan + 0.15*scaffold + 0.15*gameplay + 0.15*polish + 0.10*critique + 0.20*sandbox + 0.05*bundle. Every component is a marketing-relevant metric.

5. **Rollout decision:** Only template versions with a positive quality trend reach production.

Implementation Details

The analytics feedback loop is wired directly into the architecture.

Snapshot as Analytics Record

Every run creates a snapshot that captures marketing-critical data:

```json

{

"brief": "Block puzzle game with space theme, 30 second play time",

"template_version": "path-routing@1.3.0",

"phase_outputs": {

"plan": { "score": 92 },

"gameplay": { "score": 94 },

"sandbox-verify": { "score": 95, "fps_avg": 58 }

}

}

```

This answers: Which template version performs best? Which phase introduces the most errors? What is average FPS across all generated ads?

Quality Gates as Funnel Analysis

| Gate | What It Checks | Marketing Equivalent |

|---|---|---|

| Input Guard | Valid prompt, correct config | Campaign brief quality check |

| Static Contract | JSON shape, JS parse, CTA | Creative format compliance |

| Template Contract | Plan matches template mechanics | Brand guideline enforcement |

| Runtime Verification | Sandbox signals, FPS, black-screen | User experience QA |

| Bounded Repair | Targeted LLM fixes, scoring limit | Creative iteration with budget |

| Shipping Compliance | Meets MRAID/TikTok requirements | Ad network approval gate |

Each gate produces a pass/fail/review verdict. The trend line tells you exactly where the creative process is broken.

The SkeletonComposer: Determinism as Baseline

One of the most surprising analytics features is the SkeletonComposer. Unlike every other phase, the skeleton phase is **deterministic** - it interpolates the plan into a reference skeleton using slot replacement, zero LLM tokens:

```javascript

const code = this.interpolator.interpolate(

referenceSkeleton,

this._planToSkeletonSlots(plan)

);

```

This creates a **perfect baseline** for analytics. Since the skeleton produces identical output for identical input, you can isolate variance from every other phase. If gameplay degrades, you know it's LLM drift, not the skeleton. This is quality control most ad generation tools lack.

R2 and KV as Analytics Storage

- **R2:** Stores every generated artifact - HTML, sandbox reports, audit trail. A historical creative library for trend analysis: Which genres had the highest sandbox pass rates?

- **KV:** Fast-lookup run statuses, rollout snapshots, template version pointers. Powers real-time dashboards of quality gate pass rates.

Neither was designed for marketing analytics. Both are excellent for it.

Results & Metrics

Since deployment:

- **Regression detection:** The Evaluation Layer caught prompt degradation in the gameplay phase within 2 hours - before ads reached ad networks. Champion/challenger comparison showed a 12-point score drop.

- **Bottleneck identification:** Funnel analysis revealed the polish phase as the weakest link, 22% lower pass rate than gameplay. This directly informed prompt engineering investment.

- **Sandbox-to-performance correlation:** Ads scoring 90+ on sandbox runtime showed 35% lower early-exit rates in ad network trials vs. ads scoring 75 or below.

- **Deterministic skeleton saved ~15% token cost** while providing the analytical baseline.

The system runs 95+ tests across 15 phases, validating both generation and evaluation simultaneously.

Key Takeaways

1. Instrumentation is a first-class architectural concern

Design phase contracts, snapshot schemas, and evaluation layers from day one to produce structured, versioned, queryable output. The architecture itself becomes the analytics platform.

2. Bounded contexts prevent analytical drift

Keeping the Evaluation Layer separate from the Execution Layer ensures scoring and rollout logic stays stable as prompts and templates change. If your measurement framework drifts with every prompt update, your trends are meaningless.

3. Compliance gates are valuable marketing metrics

The MRAID validator, sandbox checker, bundle size gate - developer tools that produce data correlating directly with ad network performance. A passing bundle runs on more devices, loads faster, and converts better. The marketing value is inherent, not retrofitted.

4. BYOK is a trust layer for analytics

With Bring Your Own Key, every user's generation data stays within their API key scope. Segmenting analytics by client, campaign, or team is trivial - no multi-tenant infrastructure needed. The analytics loop is naturally per-account.

The Loop Closes

PlayableAd Studio's 11-phase pipeline does not just generate playable ads. It generates evidence, scores, baselines, and trends - the raw material of a marketing analytics engine. The architecture note says it best: **Template defines what good looks like; orchestrator moves artifacts through stations; validator decides if the artifact can proceed; sandbox is the crash test; repair is the limited workshop; evaluation decides which version lives.**

That last part - evaluation decides which version lives - is where developer infrastructure and marketing analytics converge. Both the engineer and the marketer want the same thing: to know which creative version is worth investing in. The architecture gives them a shared language to figure it out.