Engineering Playbooks•18 de noviembre de 2025

Building a Lightweight RAG Stack on a Shoestring

A practical RAG build for devs on a budget: ingest once, store embeddings in SQLite/pg, add a rerank pass, and deploy on Fly or Render with sensible perf defaults.

By Rev.AISomething

Library shelves conveying retrieval and organized knowledge

Retrieval-augmented generation can be lean. This guide ships a small RAG stack with SQLite or Postgres, open embeddings, a rerank pass, and a low-cost deploy on Fly or Render.

Goal: predictable cost, predictable latency, clear fallbacks.

Scope:

Ingestion that chunks text and writes embeddings to SQLite/pg
Retrieval that runs a vector search, reranks, and returns top contexts
A minimal Next.js handler to serve answers
Deployment tips for Fly/Render and perf guardrails
Observability, evals, and a runbook so you can keep it healthy

Architecture at a Glance

Store: SQLite with sqlite-vec for local/single-instance; Postgres with pgvector for shared instances.
Embeddings: Small models (OpenAI text-embedding-3-small or open embeddings API) to keep latency and cost down.
Rerank: A light reranker (Cohere rerank, Voyage, or open-source cross-encoder) to improve precision after broad recall.
API surface: /api/ask that accepts a question, returns an answer plus cited chunks, and reports token use.
Deploy: Fly (shared CPU, small volume) or Render (starter instance). One region, one DB. Scale when hit rates or uptime demand it.
Observability: Latency per stage, token counts, cache hit rate, and top queries logged to keep drift visible.

Key Concepts

Chunking: Splitting long docs into smaller passages so search can find the right paragraph instead of the whole PDF.
Embeddings: Turning text into vectors so similar passages sit near each other in vector space.
Reranking: A second pass that reorders the top search hits by semantic fit to the question.
Context window budgeting: Sending only the best few chunks to the model to avoid cost and distraction.
Feedback loop: Log what was retrieved and how often it helped; adjust chunking, k, and rerank thresholds with data.

Ingest: Chunk, Embed, Persist

// scripts/ingest.ts
// Chunk text, embed, and store in SQLite with sqlite-vec.
import Database from "better-sqlite3";
import { embed } from "@/lib/embed"; // Wraps your embedding provider client.
import { chunkText } from "@/lib/chunk"; // Simple splitter (e.g., 500 tokens with overlap).

const db = new Database("rag.db");
db.exec(`
  CREATE TABLE IF NOT EXISTS documents (
    id TEXT PRIMARY KEY,
    content TEXT NOT NULL,
    source TEXT NOT NULL
  );
  CREATE VIRTUAL TABLE IF NOT EXISTS doc_index USING vec0(
    id TEXT PRIMARY KEY,
    embedding FLOAT[1536]
  );
`);

async function ingest(source: string, text: string) {
  const chunks = chunkText(text, 500, 60); // Overlap to keep context coherent.
  const insertDoc = db.prepare("INSERT OR REPLACE INTO documents (id, content, source) VALUES (?, ?, ?)");
  const insertVec = db.prepare("INSERT OR REPLACE INTO doc_index (id, embedding) VALUES (?, ?)");

  for (const [i, chunk] of chunks.entries()) {
    const id = `${source}#${i}`;
    const embedding = await embed(chunk); // Returns number[] length 1536.
    insertDoc.run(id, chunk, source);
    insertVec.run(id, JSON.stringify(embedding));
  }
}

// Example usage; replace with your corpus loader.
await ingest("handbook", await fs.promises.readFile("handbook.md", "utf8"));

Notes:

For Postgres, swap sqlite-vec for pgvector (CREATE EXTENSION vector;) and store embeddings in a vector(1536) column.
Keep chunk sizes modest; smaller chunks improve recall but may need a rerank to maintain precision.
Batch embeddings in small groups to avoid rate limits and to parallelize safely.
Clean input first: strip boilerplate (nav, footers), dedupe near-identical sections, and normalize whitespace. Bad input yields bad retrieval.
Keep a source that lets you link back to the original doc for citations and debugging.

Postgres schema sketch

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id text PRIMARY KEY,
  content text NOT NULL,
  source text NOT NULL
);

CREATE TABLE doc_index (
  id text PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
  embedding vector(1536) NOT NULL
);

-- Optional: narrow index for faster ANN
CREATE INDEX ON doc_index USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Query: Retrieve, Rerank, Answer

// app/api/ask/route.ts
import { NextResponse } from "next/server";
import Database from "better-sqlite3";
import { embed } from "@/lib/embed";
import { rerank } from "@/lib/rerank"; // Thin wrapper over your rerank provider.
import { answerWithContext } from "@/lib/llm"; // Calls your chat model with context.

const db = new Database("rag.db", { readonly: true });

export async function POST(req: Request) {
  const { question } = await req.json();
  if (!question) {
    return NextResponse.json({ error: "Question is required" }, { status: 400 });
  }

  const queryVec = await embed(question);
  // Nearest neighbors from vec index; adjust k to match chunk size/recall needs.
  const rows = db
    .prepare("SELECT id, embedding FROM doc_index ORDER BY embedding <-> json(?) LIMIT 15")
    .all(JSON.stringify(queryVec));

  // Fetch chunk text and rerank for semantic fit.
  const candidates = rows.map((row) => {
    const doc = db.prepare("SELECT content, source FROM documents WHERE id = ?").get(row.id);
    return { id: row.id, content: doc.content, source: doc.source };
  });

  const reranked = await rerank(question, candidates, { topK: 6 });
  const top = reranked.slice(0, 6);

  const answer = await answerWithContext({
    question,
    contexts: top.map((c) => c.content),
  });

  return NextResponse.json({ answer, contexts: top });
}

Implementation tips:

For Postgres, use SELECT id, content, source FROM documents ORDER BY embedding <-> $1 LIMIT 15 with to_jsonb for the vector input.
Keep rerank optional. If you skip it, raise LIMIT and tighten chunk sizes to avoid irrelevant context.
Log latency per stage (embed, query, rerank, generate) to spot bottlenecks.
Cache the final answer for identical questions for 60–120 seconds to absorb retries.
Return contexts with sources in the response so you can render citations and debug misses.

Minimal error and input guards

if (question.length > 500) {
  return NextResponse.json({ error: "Question too long" }, { status: 400 });
}

Add limits for topK and model parameters so callers cannot drive runaway compute.

Deployment: Fly or Render on a Budget

Fly: Use fly volume for SQLite persistence. For Postgres, use Fly Postgres Lite and colocate app + DB in the same region. Set PRIMARY_REGION and pin your app to reduce cross-region latency.
Render: Starter instance for the app, free Postgres for tests; upgrade to Standard when write volume grows. Mount persistent disk for SQLite only if you accept single-instance constraints.
Connections: Enable connection pooling for pgvector. For SQLite, ensure WAL mode is on for concurrent reads.
Cold starts: Preload the DB connection and warmed embedding client at module scope to avoid startup spikes.

Performance and Cost Guardrails

Cap k for search (e.g., 15) and topK for rerank (e.g., 6) to keep latency predictable.
Cache embeddings for repeated questions during load tests to avoid accidental cost spikes.
Prefer smaller embedding models for recall; upgrade only if your domain accuracy requires it.
Stream answers to keep perceived latency low; charge only for tokens you send.
Add a simple budget alarm: if token spend per hour exceeds a threshold, tighten topK and context length.
Keep k and rerank thresholds configurable via env so you can tune without redeploys.
Add dead-letter logging for queries that return empty results; these feed back into ingestion or chunk-size tweaks.

Quality and Evaluation Loop

Build a small eval set (50–100 Q&A pairs). Include edge cases: multi-part questions, entity disambiguation, and dates.
Run retrieval-only evals first: log hit rate at k=5/10 without generation to isolate recall issues.
Then run full RAG evals: measure exact match or semantic similarity of answers plus citation accuracy.
Track hallucination risk by counting answers that omit any context reference; lower topK and improve rerank if hallucinations rise.

Operational Runbook

Health: p50/p95 latency per stage; errors grouped by stage (embed, search, rerank, generate).
Backpressure: If queue depth or latency spikes, temporarily drop rerank (cheaper) or lower topK.
Data changes: When you add or edit documents, re-embed only touched documents; avoid full reindex unless necessary.
Secrets: Keep provider keys server-side; rotate keys on schedule and on incident.
Backups: Daily SQLite/pg backups; test restore monthly.

Checklist to Ship

Embedding and rerank providers wired with timeouts and retries.
SQLite/pg schema created with vector support and seeded content.
/api/ask returns answers plus cited contexts; latency logged per stage.
Deploy manifest for Fly or Render with secrets/envs documented.
Budget guardrails: token alarms and capped topK.
Eval set runs in CI or a cron to catch regressions when you change chunking or k.