Engineering Playbooksâą
Building a Lightweight RAG Stack on a Shoestring
A practical RAG build for devs on a budget: ingest once, store embeddings in SQLite/pg, add a rerank pass, and deploy on Fly or Render with sensible perf defaults.
By Rev.AISomething
Retrieval-augmented generation can be lean. This guide ships a small RAG stack with SQLite or Postgres, open embeddings, a rerank pass, and a low-cost deploy on Fly or Render.
Goal: predictable cost, predictable latency, clear fallbacks.
Scope:
- Ingestion that chunks text and writes embeddings to SQLite/pg
- Retrieval that runs a vector search, reranks, and returns top contexts
- A minimal Next.js handler to serve answers
- Deployment tips for Fly/Render and perf guardrails
- Observability, evals, and a runbook so you can keep it healthy
Architecture at a Glance
- Store: SQLite with
sqlite-vecfor local/single-instance; Postgres with pgvector for shared instances. - Embeddings: Small models (OpenAI text-embedding-3-small or open embeddings API) to keep latency and cost down.
- Rerank: A light reranker (Cohere rerank, Voyage, or open-source cross-encoder) to improve precision after broad recall.
- API surface:
/api/askthat accepts a question, returns an answer plus cited chunks, and reports token use. - Deploy: Fly (shared CPU, small volume) or Render (starter instance). One region, one DB. Scale when hit rates or uptime demand it.
- Observability: Latency per stage, token counts, cache hit rate, and top queries logged to keep drift visible.
Key Concepts
- Chunking: Splitting long docs into smaller passages so search can find the right paragraph instead of the whole PDF.
- Embeddings: Turning text into vectors so similar passages sit near each other in vector space.
- Reranking: A second pass that reorders the top search hits by semantic fit to the question.
- Context window budgeting: Sending only the best few chunks to the model to avoid cost and distraction.
- Feedback loop: Log what was retrieved and how often it helped; adjust chunking, k, and rerank thresholds with data.
Ingest: Chunk, Embed, Persist
// scripts/ingest.ts
// Chunk text, embed, and store in SQLite with sqlite-vec.
import Database from "better-sqlite3";
import { embed } from "@/lib/embed"; // Wraps your embedding provider client.
import { chunkText } from "@/lib/chunk"; // Simple splitter (e.g., 500 tokens with overlap).
const db = new Database("rag.db");
db.exec(`
CREATE TABLE IF NOT EXISTS documents (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
source TEXT NOT NULL
);
CREATE VIRTUAL TABLE IF NOT EXISTS doc_index USING vec0(
id TEXT PRIMARY KEY,
embedding FLOAT[1536]
);
`);
async function ingest(source: string, text: string) {
const chunks = chunkText(text, 500, 60); // Overlap to keep context coherent.
const insertDoc = db.prepare("INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)");
const insertVec = db.prepare("INSERT OR REPLACE INTO doc_index (id, embedding) VALUES (?, ?)");
for (const [i, chunk] of chunks.entries()) {
const id = `${source}#${i}`;
const embedding = await embed(chunk); // Returns number[] length 1536.
insertDoc.run(id, chunk, source);
insertVec.run(id, JSON.stringify(embedding));
}
}
// Example usage; replace with your corpus loader.
await ingest("handbook", await fs.promises.readFile("handbook.md", "utf8"));
Notes:
- For Postgres, swap sqlite-vec for pgvector (
CREATE EXTENSION vector;) and store embeddings in avector(1536)column. - Keep chunk sizes modest; smaller chunks improve recall but may need a rerank to maintain precision.
- Batch embeddings in small groups to avoid rate limits and to parallelize safely.
- Clean input first: strip boilerplate (nav, footers), dedupe near-identical sections, and normalize whitespace. Bad input yields bad retrieval.
- Keep a
sourcethat lets you link back to the original doc for citations and debugging.
Postgres schema sketch
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id text PRIMARY KEY,
content text NOT NULL,
source text NOT NULL
);
CREATE TABLE doc_index (
id text PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
embedding vector(1536) NOT NULL
);
-- Optional: narrow index for faster ANN
CREATE INDEX ON doc_index USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Query: Retrieve, Rerank, Answer
// app/api/ask/route.ts
import { NextResponse } from "next/server";
import Database from "better-sqlite3";
import { embed } from "@/lib/embed";
import { rerank } from "@/lib/rerank"; // Thin wrapper over your rerank provider.
import { answerWithContext } from "@/lib/llm"; // Calls your chat model with context.
const db = new Database("rag.db", { readonly: true });
export async function POST(req: Request) {
const { question } = await req.json();
if (!question) {
return NextResponse.json({ error: "Question is required" }, { status: 400 });
}
const queryVec = await embed(question);
// Nearest neighbors from vec index; adjust k to match chunk size/recall needs.
const rows = db
.prepare("SELECT id, embedding FROM doc_index ORDER BY embedding <-> json(?) LIMIT 15")
.all(JSON.stringify(queryVec));
// Fetch chunk text and rerank for semantic fit.
const candidates = rows.map((row) => {
const doc = db.prepare("SELECT content, source FROM documents WHERE id = ?").get(row.id);
return { id: row.id, content: doc.content, source: doc.source };
});
const reranked = await rerank(question, candidates, { topK: 6 });
const top = reranked.slice(0, 6);
const answer = await answerWithContext({
question,
contexts: top.map((c) => c.content),
});
return NextResponse.json({ answer, contexts: top });
}
Implementation tips:
- For Postgres, use
SELECT id, content, source FROM documents ORDER BY embedding <-> $1 LIMIT 15withto_jsonbfor the vector input. - Keep rerank optional. If you skip it, raise
LIMITand tighten chunk sizes to avoid irrelevant context. - Log latency per stage (embed, query, rerank, generate) to spot bottlenecks.
- Cache the final answer for identical questions for 60â120 seconds to absorb retries.
- Return contexts with sources in the response so you can render citations and debug misses.
Minimal error and input guards
if (question.length > 500) {
return NextResponse.json({ error: "Question too long" }, { status: 400 });
}
Add limits for topK and model parameters so callers cannot drive runaway compute.
Deployment: Fly or Render on a Budget
- Fly: Use
fly volumefor SQLite persistence. For Postgres, use Fly Postgres Lite and colocate app + DB in the same region. SetPRIMARY_REGIONand pin your app to reduce cross-region latency. - Render: Starter instance for the app, free Postgres for tests; upgrade to Standard when write volume grows. Mount
persistent diskfor SQLite only if you accept single-instance constraints. - Connections: Enable connection pooling for pgvector. For SQLite, ensure WAL mode is on for concurrent reads.
- Cold starts: Preload the DB connection and warmed embedding client at module scope to avoid startup spikes.
Performance and Cost Guardrails
- Cap
kfor search (e.g., 15) andtopKfor rerank (e.g., 6) to keep latency predictable. - Cache embeddings for repeated questions during load tests to avoid accidental cost spikes.
- Prefer smaller embedding models for recall; upgrade only if your domain accuracy requires it.
- Stream answers to keep perceived latency low; charge only for tokens you send.
- Add a simple budget alarm: if token spend per hour exceeds a threshold, tighten
topKand context length. - Keep
kand rerank thresholds configurable via env so you can tune without redeploys. - Add dead-letter logging for queries that return empty results; these feed back into ingestion or chunk-size tweaks.
Quality and Evaluation Loop
- Build a small eval set (50â100 Q&A pairs). Include edge cases: multi-part questions, entity disambiguation, and dates.
- Run retrieval-only evals first: log hit rate at k=5/10 without generation to isolate recall issues.
- Then run full RAG evals: measure exact match or semantic similarity of answers plus citation accuracy.
- Track hallucination risk by counting answers that omit any context reference; lower
topKand improve rerank if hallucinations rise.
Operational Runbook
- Health: p50/p95 latency per stage; errors grouped by stage (embed, search, rerank, generate).
- Backpressure: If queue depth or latency spikes, temporarily drop rerank (cheaper) or lower
topK. - Data changes: When you add or edit documents, re-embed only touched documents; avoid full reindex unless necessary.
- Secrets: Keep provider keys server-side; rotate keys on schedule and on incident.
- Backups: Daily SQLite/pg backups; test restore monthly.
Checklist to Ship
- Embedding and rerank providers wired with timeouts and retries.
- SQLite/pg schema created with vector support and seeded content.
-
/api/askreturns answers plus cited contexts; latency logged per stage. - Deploy manifest for Fly or Render with secrets/envs documented.
- Budget guardrails: token alarms and capped
topK. - Eval set runs in CI or a cron to catch regressions when you change chunking or k.