← Back to blog
Engineering Playbooks

Stop Double-Paying: Caching Strategies for AI APIs

A practical caching playbook for devs: request deduping, semantic caching, TTL tuning, and lightweight benchmarks to keep AI API spend under control.

By Rev.AISomething

Abstract network cables and server equipment representing caching infrastructure

Caching keeps latency stable and spend predictable. This post focuses on request deduping, semantic caching, TTL tuning, and a quick benchmark you can run to prove savings.

Scope:

  • Request deduping to avoid double-charging for identical prompts
  • Semantic caching when inputs are similar but not identical
  • TTL tuning that balances freshness, accuracy, and spend
  • A quick benchmarking loop to prove the savings

Request Deduping: Stop Paying Twice

Retries often double-charge. Hash the prompt and add a short TTL to reuse the same response.

// app/api/complete/route.ts
import crypto from "node:crypto";
import { NextResponse } from "next/server";
// Replace with your KV of choice (Redis/Upstash). Using a Map here for clarity.
const inMemory = new Map<string, { ttl: number; value: unknown }>();

const ttlMs = 60_000; // Cache identical prompts for 1 minute to absorb retries.

async function getFromKV(key: string) {
  const cached = inMemory.get(key);
  if (!cached) return null;
  const isExpired = cached.ttl < Date.now();
  if (isExpired) {
    inMemory.delete(key);
    return null;
  }
  return cached.value;
}

async function setInKV(key: string, value: unknown) {
  inMemory.set(key, { value, ttl: Date.now() + ttlMs });
}

export async function POST(req: Request) {
  const body = await req.json();
  const prompt = body?.prompt ?? "";
  const key = crypto.createHash("sha256").update(prompt).digest("hex");

  // Fast path: serve the cached response if it exists.
  const cached = await getFromKV(key);
  if (cached) {
    return NextResponse.json({ source: "cache", ...cached });
  }

  // Call your AI provider once; retries will reuse the cached result.
  const completion = await callModel(prompt); // Replace with your provider client.

  await setInKV(key, { completion });
  return NextResponse.json({ source: "live", completion });
}

async function callModel(prompt: string) {
  // Minimal placeholder to show shape; plug in OpenAI/Anthropic/etc.
  return { text: `echo: ${prompt}` };
}

Implementation notes for devs:

  • Use a shared KV (Redis/Upstash/Supabase) instead of in-process Maps once you scale past one instance.
  • Include the model name, temperature, and system prompt in the hash to avoid serving mismatched outputs.
  • For streaming completions, cache the final text plus token usage so your FinOps view stays accurate.

Semantic Caching: Near-Match Wins

When prompts differ slightly (“write a summary about X” vs “summarize X quickly”), exact hashes miss. A semantic cache stores embeddings of requests and reuses the closest hit when it is both similar and fresh enough.

// Lightweight in-memory store; swap for pgvector/SQLite vec in production.
type SemanticEntry = { key: string; embedding: number[]; response: string; ts: number };
const semanticStore: SemanticEntry[] = [];

// Basic cosine similarity helper for small vectors.
function cosineSimilarity(a: number[], b: number[]) {
  const dot = a.reduce((sum, value, i) => sum + value * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, value) => sum + value * value, 0));
  const normB = Math.sqrt(b.reduce((sum, value) => sum + value * value, 0));
  return dot / (normA * normB);
}

async function semanticLookup(query: string, embeddingFn: (t: string) => Promise<number[]>) {
  const queryEmbedding = await embeddingFn(query);
  let best: SemanticEntry | null = null;
  let bestScore = 0;

  for (const entry of semanticStore) {
    // Skip expired entries and keep the most similar fresh hit.
    const score = cosineSimilarity(queryEmbedding, entry.embedding);
    const isFresh = Date.now() - entry.ts < 5 * 60_000; // 5-minute semantic TTL.
    if (!isFresh) continue;
    if (!best || score > bestScore) {
      best = entry;
      bestScore = score;
    }
  }

  if (best && bestScore > 0.92) {
    return { hit: true, response: best.response };
  }
  return { hit: false };
}

Practical guardrails:

  • Keep semantic TTLs short; model knowledge changes fast.
  • Use higher similarity thresholds (0.9+). Drop to 0.85 only when latency is critical.
  • Cap store size and evict LRU to avoid memory blowups on small servers.
  • If you have pgvector or SQLite vec, store embeddings there to survive restarts.

TTL Tuning That Balances Freshness and Spend

  • Live user actions (chat, form fill): 30–120 seconds. Enough to absorb retries and double-submits.
  • Shared knowledge bases that change slowly: 15–60 minutes for exact hashes; 5–10 minutes for semantic.
  • Model selection responses (tool choice, routing): 5–15 minutes so routing stabilizes under bursty traffic.
  • Embedding caches: 7–30 days if the source text is immutable; invalidate on document updates.
  • Always log cache age on hits so you can spot stale responses before users do.

When in doubt, start with conservative TTLs and lengthen them only after monitoring hit ratios and correctness.


Benchmarking: Prove the Cache Pays for Itself

Baseline before and after caching:

  1. Choose 50–100 representative prompts; include near-duplicates to test semantic reuse.
  2. Call the AI provider with caching disabled; record p50/p95 latency and total tokens.
  3. Enable dedupe + semantic cache; repeat the run.
  4. Compare: cache hit rate, token reduction, latency savings, and any incorrect reuse.

Example metrics to print per run:

requests=100 hits=43 semanticHits=12 miss=57
latency_ms: p50=420 p95=880
tokens_total=182k tokens_saved=61k est_savings_usd=$1.83
incorrect_reuse=0

If incorrect_reuse is non-zero, tighten similarity or shorten semantic TTLs.


Observability: Keep the Feedback Loop Tight

  • Log hit/miss/semantic-hit with request identifiers (no PII) and cache age.
  • Emit token counts for live vs cached responses to show real savings.
  • Alert on rising miss rates; it often signals TTLs are too short or storage is failing.
  • Track eviction counts to spot when capacity is too small for your workload.

Ship-It Checklist for Devs

  • Deduplicate retries with a prompt hash and 60–120s TTL in a shared KV.
  • Add semantic caching for high-churn prompts with a 0.9+ similarity floor and short TTL.
  • Log cache age, hit type, and token savings; review weekly.
  • Benchmark before/after to prove the cache pays for itself and adjust thresholds with data.
  • Keep the implementation boring: standard KV + optional vector store, no exotic dependencies needed.
CachingRate LimitsAI EngineeringCost Control

Ready to launch your app?

By submitting this form you agree to our privacy policy.

Quote-ready scopes in 24 hours

  • Quote within 24 hours
  • Response within 2 hours
  • No commitment
We switched from the customer booking tool and the separate staff scheduler for one custom app that handles both. It fits how our shop runs and costs less than what we were paying before.
Lisa NguyenSMB salon owner
Book a free call