← Back to blog
Engineering Playbooks

Building a Lightweight RAG Stack on a Shoestring

A practical RAG build for devs on a budget: ingest once, store embeddings in SQLite/pg, add a rerank pass, and deploy on Fly or Render with sensible perf defaults.

By Rev.AISomething

Library shelves conveying retrieval and organized knowledge

Retrieval-augmented generation can be lean. This guide ships a small RAG stack with SQLite or Postgres, open embeddings, a rerank pass, and a low-cost deploy on Fly or Render.

Goal: predictable cost, predictable latency, clear fallbacks.

Scope:

  • Ingestion that chunks text and writes embeddings to SQLite/pg
  • Retrieval that runs a vector search, reranks, and returns top contexts
  • A minimal Next.js handler to serve answers
  • Deployment tips for Fly/Render and perf guardrails
  • Observability, evals, and a runbook so you can keep it healthy

Architecture at a Glance

  • Store: SQLite with sqlite-vec for local/single-instance; Postgres with pgvector for shared instances.
  • Embeddings: Small models (OpenAI text-embedding-3-small or open embeddings API) to keep latency and cost down.
  • Rerank: A light reranker (Cohere rerank, Voyage, or open-source cross-encoder) to improve precision after broad recall.
  • API surface: /api/ask that accepts a question, returns an answer plus cited chunks, and reports token use.
  • Deploy: Fly (shared CPU, small volume) or Render (starter instance). One region, one DB. Scale when hit rates or uptime demand it.
  • Observability: Latency per stage, token counts, cache hit rate, and top queries logged to keep drift visible.

Key Concepts

  • Chunking: Splitting long docs into smaller passages so search can find the right paragraph instead of the whole PDF.
  • Embeddings: Turning text into vectors so similar passages sit near each other in vector space.
  • Reranking: A second pass that reorders the top search hits by semantic fit to the question.
  • Context window budgeting: Sending only the best few chunks to the model to avoid cost and distraction.
  • Feedback loop: Log what was retrieved and how often it helped; adjust chunking, k, and rerank thresholds with data.

Ingest: Chunk, Embed, Persist

// scripts/ingest.ts
// Chunk text, embed, and store in SQLite with sqlite-vec.
import Database from "better-sqlite3";
import { embed } from "@/lib/embed"; // Wraps your embedding provider client.
import { chunkText } from "@/lib/chunk"; // Simple splitter (e.g., 500 tokens with overlap).

const db = new Database("rag.db");
db.exec(`
  CREATE TABLE IF NOT EXISTS documents (
    id TEXT PRIMARY KEY,
    content TEXT NOT NULL,
    source TEXT NOT NULL
  );
  CREATE VIRTUAL TABLE IF NOT EXISTS doc_index USING vec0(
    id TEXT PRIMARY KEY,
    embedding FLOAT[1536]
  );
`);

async function ingest(source: string, text: string) {
  const chunks = chunkText(text, 500, 60); // Overlap to keep context coherent.
  const insertDoc = db.prepare("INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)");
  const insertVec = db.prepare("INSERT OR REPLACE INTO doc_index (id, embedding) VALUES (?, ?)");

  for (const [i, chunk] of chunks.entries()) {
    const id = `${source}#${i}`;
    const embedding = await embed(chunk); // Returns number[] length 1536.
    insertDoc.run(id, chunk, source);
    insertVec.run(id, JSON.stringify(embedding));
  }
}

// Example usage; replace with your corpus loader.
await ingest("handbook", await fs.promises.readFile("handbook.md", "utf8"));

Notes:

  • For Postgres, swap sqlite-vec for pgvector (CREATE EXTENSION vector;) and store embeddings in a vector(1536) column.
  • Keep chunk sizes modest; smaller chunks improve recall but may need a rerank to maintain precision.
  • Batch embeddings in small groups to avoid rate limits and to parallelize safely.
  • Clean input first: strip boilerplate (nav, footers), dedupe near-identical sections, and normalize whitespace. Bad input yields bad retrieval.
  • Keep a source that lets you link back to the original doc for citations and debugging.

Postgres schema sketch

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id text PRIMARY KEY,
  content text NOT NULL,
  source text NOT NULL
);

CREATE TABLE doc_index (
  id text PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
  embedding vector(1536) NOT NULL
);

-- Optional: narrow index for faster ANN
CREATE INDEX ON doc_index USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Query: Retrieve, Rerank, Answer

// app/api/ask/route.ts
import { NextResponse } from "next/server";
import Database from "better-sqlite3";
import { embed } from "@/lib/embed";
import { rerank } from "@/lib/rerank"; // Thin wrapper over your rerank provider.
import { answerWithContext } from "@/lib/llm"; // Calls your chat model with context.

const db = new Database("rag.db", { readonly: true });

export async function POST(req: Request) {
  const { question } = await req.json();
  if (!question) {
    return NextResponse.json({ error: "Question is required" }, { status: 400 });
  }

  const queryVec = await embed(question);
  // Nearest neighbors from vec index; adjust k to match chunk size/recall needs.
  const rows = db
    .prepare("SELECT id, embedding FROM doc_index ORDER BY embedding <-> json(?) LIMIT 15")
    .all(JSON.stringify(queryVec));

  // Fetch chunk text and rerank for semantic fit.
  const candidates = rows.map((row) => {
    const doc = db.prepare("SELECT content, source FROM documents WHERE id = ?").get(row.id);
    return { id: row.id, content: doc.content, source: doc.source };
  });

  const reranked = await rerank(question, candidates, { topK: 6 });
  const top = reranked.slice(0, 6);

  const answer = await answerWithContext({
    question,
    contexts: top.map((c) => c.content),
  });

  return NextResponse.json({ answer, contexts: top });
}

Implementation tips:

  • For Postgres, use SELECT id, content, source FROM documents ORDER BY embedding <-> $1 LIMIT 15 with to_jsonb for the vector input.
  • Keep rerank optional. If you skip it, raise LIMIT and tighten chunk sizes to avoid irrelevant context.
  • Log latency per stage (embed, query, rerank, generate) to spot bottlenecks.
  • Cache the final answer for identical questions for 60–120 seconds to absorb retries.
  • Return contexts with sources in the response so you can render citations and debug misses.

Minimal error and input guards

if (question.length > 500) {
  return NextResponse.json({ error: "Question too long" }, { status: 400 });
}

Add limits for topK and model parameters so callers cannot drive runaway compute.


Deployment: Fly or Render on a Budget

  • Fly: Use fly volume for SQLite persistence. For Postgres, use Fly Postgres Lite and colocate app + DB in the same region. Set PRIMARY_REGION and pin your app to reduce cross-region latency.
  • Render: Starter instance for the app, free Postgres for tests; upgrade to Standard when write volume grows. Mount persistent disk for SQLite only if you accept single-instance constraints.
  • Connections: Enable connection pooling for pgvector. For SQLite, ensure WAL mode is on for concurrent reads.
  • Cold starts: Preload the DB connection and warmed embedding client at module scope to avoid startup spikes.

Performance and Cost Guardrails

  • Cap k for search (e.g., 15) and topK for rerank (e.g., 6) to keep latency predictable.
  • Cache embeddings for repeated questions during load tests to avoid accidental cost spikes.
  • Prefer smaller embedding models for recall; upgrade only if your domain accuracy requires it.
  • Stream answers to keep perceived latency low; charge only for tokens you send.
  • Add a simple budget alarm: if token spend per hour exceeds a threshold, tighten topK and context length.
  • Keep k and rerank thresholds configurable via env so you can tune without redeploys.
  • Add dead-letter logging for queries that return empty results; these feed back into ingestion or chunk-size tweaks.

Quality and Evaluation Loop

  • Build a small eval set (50–100 Q&A pairs). Include edge cases: multi-part questions, entity disambiguation, and dates.
  • Run retrieval-only evals first: log hit rate at k=5/10 without generation to isolate recall issues.
  • Then run full RAG evals: measure exact match or semantic similarity of answers plus citation accuracy.
  • Track hallucination risk by counting answers that omit any context reference; lower topK and improve rerank if hallucinations rise.

Operational Runbook

  • Health: p50/p95 latency per stage; errors grouped by stage (embed, search, rerank, generate).
  • Backpressure: If queue depth or latency spikes, temporarily drop rerank (cheaper) or lower topK.
  • Data changes: When you add or edit documents, re-embed only touched documents; avoid full reindex unless necessary.
  • Secrets: Keep provider keys server-side; rotate keys on schedule and on incident.
  • Backups: Daily SQLite/pg backups; test restore monthly.

Checklist to Ship

  • Embedding and rerank providers wired with timeouts and retries.
  • SQLite/pg schema created with vector support and seeded content.
  • /api/ask returns answers plus cited contexts; latency logged per stage.
  • Deploy manifest for Fly or Render with secrets/envs documented.
  • Budget guardrails: token alarms and capped topK.
  • Eval set runs in CI or a cron to catch regressions when you change chunking or k.
RAGRetrievalEmbeddingsCost Control

¿Listo para lanzar tu app?

By submitting this form you agree to our privacy policy.

Quote-ready scopes in 24 hours

  • Cotización en 24 horas
  • Respuesta en 2 horas
  • Sin compromiso
Cambiamos del software de reservas para clientes y del planificador de personal separados a una sola app personalizada que hace ambas cosas. Se ajusta a cómo opera nuestro salón y cuesta menos que lo que pagábamos antes.
Lisa NguyenSMB salon owner
Book a free call