· 7 min read

Buidling CLI with RAG on Cloudflare AI

A walkthrough of the AI layer behind the terminal on this site

This site has a command line interface. Type anything and it answers questions about my work in first person, routes navigation commands, and links to relevant pages. The AI runs entirely on Cloudflare’s free tier. No external APIs, no bill.

Give it a try:

This is a walkthrough of how the AI layer works — and enough code to build your own.

What we’re building

A Cloudflare Pages Function that receives a text input, decides what to do with it, and returns a structured response. The AI handles two jobs: classifying navigation intent, and answering freeform questions grounded in your own data.

The approach is called RAG — Retrieval Augmented Generation. Instead of relying on the model’s training data, you retrieve your own content and inject it into the prompt. The model answers from what you gave it, not from what it guesses.

Setting up Cloudflare AI

Cloudflare Workers AI gives you access to open-source models through a simple binding. Add this to your wrangler.toml:

[[ai]]
binding = "AI"

That’s the entire setup. context.env.AI is now available inside any Pages Function as a typed object with a .run(modelId, { messages }) method. Supported models include Llama, Gemma, and Mistral. The free tier covers 10,000 inference requests per day — plenty for a portfolio.

If you’re deploying via the Cloudflare dashboard instead, add the AI binding under Settings → Functions → AI Bindings.

Generating the context file at build

The AI needs to know about your content. Rather than querying content collections or reading entire website at runtime, we generate a static JSON file at build time. This keeps the Pages Function fast and simple — it just fetches one file.

Add this integration to astro.config.mjs:

import fs from 'node:fs';
import path from 'node:path';

function exportTerminalData() {
  function writeTerminalData(outputPath) {
    // Read your projects
    const projectsDir = path.join(process.cwd(), 'src/content/projects');
    const projects = fs.readdirSync(projectsDir)
      .filter(f => f.endsWith('.json'))
      .map(f => {
        const data = JSON.parse(fs.readFileSync(path.join(projectsDir, f), 'utf-8'));
        return {
          slug:        f.replace('.json', ''),
          title:       data.title,
          description: data.description ?? null,
          summary:     data.summary ?? null,
        };
      });

    // Read your writing
    const writingDir = path.join(process.cwd(), 'src/content/writing');
    const posts = fs.readdirSync(writingDir)
      .filter(f => f.endsWith('.mdx'))
      .map(f => {
        const raw  = fs.readFileSync(path.join(writingDir, f), 'utf-8');
        const match = raw.match(/title:\s*["']?(.+?)["']?\n/);
        return { slug: f.replace('.mdx', ''), title: match?.[1] ?? f };
      });

    fs.writeFileSync(outputPath, JSON.stringify({ projects, posts }, null, 2));
  }

  return {
    name: 'export-terminal-data',
    hooks: {
      // Write to public/ during dev so the dev server can serve it
      'astro:server:start'() {
        writeTerminalData(path.join(process.cwd(), 'public/terminal-data.json'));
      },
      // Write to dist/ during build so Cloudflare Pages serves it
      'astro:build:done'({ dir }) {
        writeTerminalData(path.join(dir.pathname, 'terminal-data.json'));
      }
    }
  };
}

export default defineConfig({
  integrations: [exportTerminalData(), /* your other integrations */],
});

The integration hooks into two Astro lifecycle events so the file exists in both dev and production. Add public/terminal-data.json to .gitignore — it’s generated, not source.

The Pages Function

Create functions/api/terminal.ts. Cloudflare maps this file to /api/terminal automatically — no routing config needed.

export async function onRequestPost(context: any) {
  const { input } = await context.request.json();
  const cmd = (input ?? "").trim().toLowerCase();
  if (!cmd) return json({ lines: [] });

  // Load the generated context file
  const url  = new URL('/terminal-data.json', context.request.url);
  const data = await fetch(url.toString()).then(r => r.json());

  // 1. Exact command match — no AI needed
  const handler = commands(data)[cmd];
  if (handler) return json({ lines: handler() });

  // 2. Intent classification — small fast model
  const intent = await detectIntent(context.env.AI, cmd);
  const aiHandler = commands(data)[intent];
  if (aiHandler) return json({ lines: aiHandler() });

  // 3. Grounded answer — large model with full context
  const lines = await generateAnswer(context.env.AI, cmd, data);
  return json({ lines });
}

The waterfall keeps things efficient: known commands are instant, ambiguous navigation goes through a cheap classifier, and only genuine questions hit the larger model.

Intent routing

Not every input needs a large model. The classifier’s job is to map natural language to known commands — or return unknown if it’s a real question.

async function detectIntent(ai: any, input: string): Promise<string> {
  const response = await ai.run("@cf/meta/llama-3.2-3b-instruct", {
    messages: [
      {
        role: "system",
        content: `You are a command router for a portfolio CLI.
Available commands: about, work, writing, contact, help.

Return a command name only for clear navigation requests:
"show me your projects" → work
"tell me about yourself" → about
"how do I contact you" → contact

For any factual question, reply with exactly: unknown
One word only.`,
      },
      { role: "user", content: input },
    ],
  });

  return (response.response ?? "unknown").trim().toLowerCase();
}

Use a small model here — llama-3.2-3b-instruct is fast and accurate enough for classification. The explicit examples in the prompt prevent it from confusing similar categories. The key rule: anything that needs a real answer should return unknown, not a command.

Generating the answer — RAG in practice

For freeform questions, the model needs to answer from facts, not from training data. The pattern is RAG — Retrieval Augmented Generation: retrieve your own content, inject it into the prompt, let the model answer from what you gave it.

Start by formatting your data into a text briefing:

function formatContext(data: any): string {
  const projects = data.projects
    .map((p: any) => `Project: ${p.title}\n${p.summary ?? p.description ?? ""}`)
    .join("\n\n");

  const posts = data.posts
    .map((p: any) => `Post: ${p.title}`)
    .join("\n");

  return `=== PROJECTS ===\n${projects}\n\n=== WRITING ===\n${posts}`;
}

Then pass it as context to the answer model:

async function generateAnswer(ai: any, input: string, data: any) {
  const context = formatContext(data);

  const response = await ai.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
    messages: [
      {
        role: "system",
        content: `You are a CLI assistant on a portfolio website.
Answer questions in first person. Be concise. Plain text only. Under 5 lines.
Use only the information provided below.
Do not invent or infer beyond what is explicitly stated.
If unanswerable, reply: I don't have that information.

${context}`,
      },
      { role: "user", content: input },
    ],
  });

  const text = (response.response ?? "").trim();
  return text.split("\n").filter(Boolean).map(t => ({ kind: "text", text: t }));
}

Classic RAG uses vector databases and embedding similarity to retrieve relevant chunks from a large corpus. When your dataset fits in the context window — a few kilobytes — you can skip all of that and inject everything. The model uses what’s relevant and ignores the rest.

The constraint use only the information provided below is the most important line in the prompt. Without it the model fills gaps with plausible fiction. With it, the worst case is an honest “I don’t know.”

Two models, two jobs: the 3B classifies (fast and cheap), the 70B answers (accurate and grounded). llama-3.3-70b-instruct-fp8-fast gives near-full-precision quality at lower memory cost.

Rendering the response

The function returns typed line objects. On the frontend, each kind renders differently:

type Line =
  | { kind: "text"; text: string }
  | { kind: "link"; text: string; href: string }
  | { kind: "error"; text: string }
  | { kind: "blank" };

This keeps the AI layer decoupled from how responses are displayed — the function doesn’t know or care whether it’s a terminal UI, a chat window, or anything else.


The full pattern scales beyond portfolios. Any time you want a conversational interface over structured data — docs, a product catalog, a knowledge base — the same three pieces apply: a classifier to route intent, a formatter to build context, and a grounded prompt to generate answers. The free tier covers a surprising amount before you need to think about cost.