Engineering Playbooksâą
Self-Hosting an LLM Inference Service with Observability
How to run your own LLM inference service with a small footprint, clear latency budgets, token-cost accounting, and end-to-end tracing.
By Rev.AISomething
Run your own inference endpoint with tight guardrails. This guide covers containerizing a model server, exposing a minimal HTTP interface, setting latency/token SLOs, and wiring logs, metrics, and traces so issues are diagnosable.
Minimal Containerized Service
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir fastapi uvicorn[standard] transformers einops accelerate
COPY server.py .
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
# server.py
# Simple inference server with latency and token accounting hooks.
import time
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
model_name = "HuggingFaceH4/zephyr-7b-alpha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
class Request(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
@app.post("/generate")
def generate(req: Request):
start = time.time()
if not req.prompt:
raise HTTPException(status_code=400, detail="prompt is required")
inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
input_tokens = inputs.input_ids.shape[-1]
outputs = model.generate(
**inputs,
max_new_tokens=req.max_tokens,
temperature=req.temperature,
do_sample=req.temperature > 0,
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
output_tokens = outputs[0].shape[-1] - input_tokens
latency_ms = int((time.time() - start) * 1000)
# In production, push these metrics to your collector.
print(
{
"event": "llm.generate",
"latency_ms": latency_ms,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"model": model_name,
}
)
return {"text": text, "latency_ms": latency_ms, "input_tokens": input_tokens, "output_tokens": output_tokens}
Notes:
- Keep models small enough for your target instance (7B on 1â2 GPUs or strong CPU). For CPUs, prefer quantized variants.
- Expose only the endpoints you need (
/generate,/healthz) to reduce surface area. - Pin versions in
requirements.txtfor reproducibility.
Latency and SLOs
- Set an explicit SLO: e.g., p95 latency †1200 ms for prompts †200 tokens and outputs †256 tokens.
- Enforce input guards: reject prompts over a limit to prevent unbounded compute.
- Add timeouts at the reverse proxy (Fly/Render/Gateway) and client SDK; align them with your p99 target.
- Track cold starts separately from steady-state latency so you know whether to add a warm pool.
Token and Cost Accounting
- Log
input_tokensandoutput_tokensper request; store aggregates in Prometheus or a cheap time-series DB. - If you wrap paid APIs as fallbacks, tag requests by provider and model to attribute spend correctly.
- Add a simple budget gate: if token output per minute exceeds a threshold, temporarily lower
max_tokensor shed low-priority traffic.
Tracing
- Use OpenTelemetry or a lightweight tracer to wrap the
/generatehandler:- Span:
llm.generate - Attributes:
model,latency_ms,input_tokens,output_tokens,temperature,max_tokens - Events:
start_load_model,end_load_model,start_generate,end_generate
- Span:
- Propagate a
traceparentheader from callers so you can connect frontend/API to inference spans. - Sample traces at a low rate (1â5%) in steady state; raise sampling when errors spike.
Deployment on a Budget
- Fly: Use
shared-cpu-1xfor small models; attach a volume only if you need local weights caching. Pin to one region to avoid weight downloads per region. Add afly.tomlhealth check hitting/healthz. - Render: Use a starter service; mount a persistent disk if you want to cache weights. Set
AUTO_SCALE=offfor deterministic cost, then scale up intentionally. - Container registry: Push images to GHCR; keep them slim by pruning unused CUDA libraries when you target CPU.
- Warm starts: Load the model at process start, not per request. If memory is tight, consider a single-worker process with a small concurrency limit.
Hardening and Safety
- Add simple content filters or a moderation hook if you expose the endpoint publicly.
- Rate-limit by API key to prevent runaway costs and to isolate abusive tenants.
- Run with read-only filesystem where possible; keep only a cache directory writable for model weights.
Checklist to Ship
- Docker image builds reproducibly; model pinned.
-
/generateenforces prompt/output limits and returns token counts and latency. - Logs include model, latency, input/output tokens; traces emit spans with attributes.
- SLO defined (e.g., p95 †1200 ms @ 200-in/256-out) and monitored.
- Deploy config for Fly/Render with health checks and timeouts set.