For most of the last decade, "add AI to your app" meant signing up for an API, storing an API key as a secret, and paying per-token. Workers AI tilts that calculus by running a curated catalogue of open-source models directly on Cloudflare's GPUs, exposed through a single binding in your Worker, with a meaningful free-tier neuron budget. No API key. No SDK. One model call is one method call.

This is the working guide: what env.AI actually is, which models are worth using today, the pricing model (neurons, not tokens), real text and image generation code from this site's backend, and an honest take on when Workers AI is the right answer vs. when you should reach for OpenRouter, OpenAI, or Anthropic instead.

The binding

[ai]
binding = "AI"

That's the whole wrangler.toml declaration. The runtime then makes env.AI available in every handler — typed, no SDK, no auth header to set. Everything is env.AI.run(modelId, input):

const out = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: 'You are a concise assistant.' },
    { role: 'user',   content: 'Give me three names for a coffee subscription.' },
  ],
});

Model IDs are namespaced by family — @cf/meta/..., @cf/black-forest-labs/..., @cf/openai/whisper, and so on. The full catalogue lives in the Cloudflare AI Models dashboard.

The catalogue, by use case

Workers AI doesn't try to compete with the closed frontier on raw capability. It curates open-source models that fit on the platform's GPUs and run cheaply per invocation. The ones that matter for a typical product:

Task	Model	Notes
Chat / instruction-following	`@cf/meta/llama-3.1-8b-instruct`, `@cf/meta/llama-3.3-70b-instruct-fp8-fast`	Good for short summaries, classification, structured extraction
Vision (image-in)	`@cf/meta/llama-3.2-11b-vision-instruct`	Image + text in, text out — SAS uses it for App Store screenshot critique
Image generation	`@cf/black-forest-labs/flux-1-schnell`, `@cf/stabilityai/stable-diffusion-xl-base-1.0`	Flux Schnell is fast and Apache-licensed; SDXL when you need fine control
Speech-to-text	`@cf/openai/whisper`, `@cf/openai/whisper-large-v3-turbo`	The classic Whisper, deployed at the edge
Text embeddings	`@cf/baai/bge-base-en-v1.5`, `@cf/baai/bge-large-en-v1.5`	Pair with Vectorize for cheap RAG
Reranking	`@cf/baai/bge-reranker-base`	The "cleaner" half of a RAG pipeline
Code	`@hf/thebloke/deepseek-coder-6.7b-instruct-awq`	Smaller code-tuned model

There are dozens more, and the list moves quickly. Treat the catalogue page as the source of truth; the families above are the stable, production-grade subset.

Pricing: neurons, not tokens

Workers AI bills in neurons — Cloudflare's normalised unit that covers tokens, image steps, audio seconds, and embedding calls under one ledger. Every model has a published "neurons per call" number. The Workers Paid plan includes 10,000 neurons/day free; beyond that, you're billed per million neurons consumed.

In practice, neurons translate roughly like this for everyday usage:

A small Llama chat completion (~500 tokens out): a handful of neurons.
A Flux Schnell 1024×1024 image at 4 steps: a few dozen neurons.
A 1-minute Whisper transcription: low double digits.

For a product that calls the AI binding occasionally (a few hundred invocations per day), Workers AI is usually free. For a product that runs AI in the hot path of every page view, you'll spend — but still less than what the same workload costs against a closed-source API.

Image generation — the SAS pattern

The Mac app's icon-maker and design-asset features both route through /api/ai/generate-image, which is a Worker that calls Workers AI Flux Schnell and stores the result in R2. Here's the actual function from saas/src/index.js:

async function generateImageViaWorkersAI(env, { prompt, width = 1024, height = 1024 }) {
  const result = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
    prompt: String(prompt).slice(0, 2048),
    num_steps: 4,
    width: Math.max(256, Math.min(parseInt(width) || 1024, 2048)),
    height: Math.max(256, Math.min(parseInt(height) || 1024, 2048)),
  });
 
  // Workers AI Flux returns { image: <base64-encoded PNG bytes> }.
  const b64 = result?.image;
  if (!b64 || typeof b64 !== 'string') {
    throw new Error('Workers AI returned no image');
  }
  const binary = atob(b64);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
  return bytes;
}

A few practical notes that come from running this in production:

Flux Schnell is fast — typical 1024×1024 generation in 1–3 seconds. That's the whole reason it replaced the previous external image provider (Krea, since deprecated). A real-time UI can call it and show the result without a loading-screen apology.
num_steps: 4 is the documented Flux Schnell sweet spot. Anything higher buys little quality; anything lower starts shedding detail.
Output is base64-encoded PNG, not raw bytes. Decode before piping to R2.
Prompts beyond ~2048 chars are accepted but truncated under the hood — clip them yourself so behaviour is predictable.

Once you have the bytes, the R2 store is a one-liner:

await env.SCREENS.put(filename, imgBytes, {
  httpMetadata: { contentType: 'image/png' },
});

The whole Worker — prompt in, public R2 URL out — fits in 40 lines.

Vision — image-in, text-out

SAS uses the Llama 3.2 11B vision model in the "AI App Store critique" endpoint: feed it a screenshot of an App Store listing, get back a critique focused on subtitle, screenshot order, and conversion concerns. The runtime call:

const aiResp = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: prompt },
        { type: 'image_url', image_url: { url: dataUrl } },
      ],
    },
  ],
  max_tokens: Math.min(parseInt(body.max_tokens) || 1024, 2048),
});
const content = aiResp?.response || aiResp?.result?.response || null;

The image can be a data: URL or an https:// URL the model can fetch. For a Worker that already has the bytes in memory, embedding them as data:image/png;base64,... is the simplest path and avoids any second round-trip.

Vision quality on the open-source side is roughly "good enough for descriptive critique, classification, and OCR-style extraction" and explicitly not good enough for tasks that require frontier-level reasoning over an image. Calibrate expectations — and have a fallback path for when the model returns nothing usable.

The graceful-fallback pattern

A real production endpoint should treat Workers AI as the cheap first try and fall back to a stronger external model on failure or low-confidence output. SAS uses this exact shape for vision:

let content = null;
let provider = null;
try {
  const aiResp = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', { /* ... */ });
  content = aiResp?.response || null;
  if (typeof content === 'string' && content.trim()) {
    provider = 'workers-ai:llama-3.2-11b-vision';
  } else {
    content = null;
  }
} catch (e) {
  // Workers AI unavailable or over quota — fall through to OpenRouter
}
 
if (!content) {
  const orResp = await fetch('https://openrouter.ai/api/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.OPENROUTER_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ /* OpenRouter payload */ }),
  });
  // ...
}

The first call costs nothing (within free-tier neurons) and succeeds 90%+ of the time for the typical input. The fallback covers the 5–10% where Workers AI either errors, returns an empty response, or trips on something outside the open model's competence — at the price of an external API call, but only for those cases.

This is the single pattern that has saved us the most on AI spend, with no impact on UX quality. Pair it with a per-route cache (KV or D1) on the output and the bill shrinks again.

Streaming responses

Long generations should stream. Workers AI supports streaming via { stream: true }:

const stream = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  },
);
return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

The returned object is a ReadableStream<Uint8Array> of SSE-formatted bytes — exactly the shape the browser EventSource API expects. For chat-shaped UIs, streaming is what turns "the page is frozen" into "the model is typing back," and it's free.

Embeddings + Vectorize = RAG without a separate database

The "right" Cloudflare RAG stack is:

Embed documents with @cf/baai/bge-base-en-v1.5 (768-dim vector per chunk).
Store the vector in Vectorize, a managed vector database that ships in the same Workers binding model as R2/D1.
Query by embedding the user's question with the same model and asking Vectorize for the top K matches.
Optionally rerank the top K with @cf/baai/bge-reranker-base before feeding the winners to the chat model.

All four steps run inside one Worker handler, with no external HTTP. For most "let me ask questions over my docs" use cases that's the entire stack — no Pinecone account, no Weaviate VM, no LangChain abstraction.

Where Workers AI honestly loses

Be straightforward about the gaps:

Frontier-grade reasoning. Llama 3.1 8B and 70B are competent generalists; they are not Claude Opus 4.7 or GPT-5. For anything that needs deep chain-of-thought, tool use, or long-context reasoning, route through OpenRouter / Anthropic / OpenAI.
Function calling. Some Workers AI models support a JSON-mode-ish constraint, but the closed APIs have richer, more reliable tool-calling.
Audio out / TTS. The catalogue is light on TTS. SAS uses ElevenLabs for paid TTS and a local Kokoro Python subprocess in the Mac app for the free path.
Image quality at the high end. Flux Schnell and SDXL are great. Midjourney v6 / Stable Diffusion 3 / Imagen 3 are still better for the demanding cases.
Cold-start latency on niche models. The popular models (Llama 3.x, Flux Schnell) are always warm. A rarely-used model can take a noticeable first-token delay.

Privacy and data residency

Workers AI invocations are not logged or used for model training by Cloudflare. The prompts and outputs flow through the AI binding and disappear; there's no cross-customer leakage and no "your data improves the model" small print. This matters for products handling user content — it's the same default you'd get from running on your own GPU, without the GPU.

Pricing scenarios at a glance

Scenario	Daily calls	Mostly	Likely cost
Hobby project	100	Free tier neurons	$0
Indie SaaS, light AI	10k	A few hundred image gens + chat	$0–5/mo
Indie SaaS, AI in hot path	200k	Chat + vision per page view	$20–80/mo
Production app, heavy use	5M	Streaming chat + RAG + image	Several hundred $/mo

Compare to the same workloads on closed APIs (OpenAI / Anthropic) and Workers AI is usually 3–10× cheaper. The trade-off is the model ceiling — if you genuinely need Claude Opus or GPT-5, no amount of neuron pricing fixes that.

The pros and cons cheat sheet

Pros

No external API key. The binding is the integration.
Edge inference. Same latency profile as your Worker — no transcontinental call to an inference provider.
Open-source catalogue. Llama, Flux, Whisper, BGE, Stable Diffusion — all reputable models with permissive licences.
Free-tier neurons that comfortably cover a side project.
No training on your prompts. Privacy posture is "your bytes, your bytes."

Cons

Not frontier capability. Best of open-source ≠ best of closed-source.
Smaller catalogue. Anything genuinely new shows up here later than on the big labs' own APIs.
Token vs neuron pricing translation requires a calculator.
Cold-start on rarely-used models.
Audio/TTS surface is thin. Plan around that.

When to reach for Workers AI

Use Workers AI when any of the following is true:

The task is a competent-but-not-genius open-source model fit — short summaries, classification, image gen, vision OCR-ish tasks, embeddings, transcription.
You want zero key-management surface — no rotating an external secret, no Vault.
You want the AI call to live in the same trust/latency boundary as the rest of the Worker.
Cost matters and the workload is high-volume but tolerant of "good enough" model quality.

Reach for OpenRouter / Anthropic / OpenAI when any of the following is true:

You need frontier reasoning, deep tool use, or very long context.
The customer expects "the best model" as a brand promise.
You're already paying for it for other reasons and the routing complexity isn't worth saving a few cents per call.

Most production stacks end up using both — Workers AI for the 90% of calls where it's enough, an external API for the 10% where it isn't, with a per-route cache in front of both. That's what this site's backend does, and it's the cheapest, fastest, and most reliable AI architecture we've shipped.

One piece left: wiring it all together

You've now seen the full Cloudflare storage-and-compute toolkit through working production code: Workers as the runtime, R2 for bytes, D1 for relational state, KV for read-cached edge state, Durable Objects for strong consistency and real-time, and Workers AI for inference. Picking the right tool for the right slot of your architecture is most of what "Cloudflare expertise" actually means — and you now have a working mental map for every one of them.

But notice how every one of those tools reached your code the same way: a binding in wrangler.toml, surfaced as env.SOMETHING. The final chapter is about that wiring itself — Ch 7: wrangler.toml & .env.local maps the four places your config can live, why secrets never touch your repo, the NEXT_PUBLIC_ trap that ships keys to every browser, and the build-time-vs-runtime gotcha that makes a value "undefined only in production." If you'd rather see all six tools work together first, the rest of simpleappshipper.com's source code is open: the backend Worker handles auth, billing, video gating, AI generation, and CI webhooks in one large route surface, and the patterns from these chapters are the only ones it uses.

← Ch 5: Durable Objects — Strong Consistency at the Edge Ch 7: wrangler.toml & .env.local — Config, Bindings & Secrets→

Course PlatformBuild a Course Platform on CloudflareBuild a paid video course platform with Cloudflare Workers, R2, D1, auth, Stripe, and paywalls.Production WebProduction Web Apps SeriesProduction patterns for web apps: caching, rate limiting, webhooks, queues, cron jobs, and idempotency.WebUltimate Web Development SeriesWeb development tutorials for HTML, CSS, JavaScript, Next.js, Workers, databases, and production shipping.

Ship your apps faster

When you're ready to publish your Swift app to the App Store, Simple App Shipper handles metadata, screenshots, TestFlight, and submissions — all in one place.

Try Simple App Shipper

Workers AI: Free Inference at the Edge