Engineering Playbooks•22 de noviembre de 2025

Self-Hosting an LLM Inference Service with Observability

How to run your own LLM inference service with a small footprint, clear latency budgets, token-cost accounting, and end-to-end tracing.

By Rev.AISomething

Developer workstation with code on screen symbolizing self-hosted inference

Run your own inference endpoint with tight guardrails. This guide covers containerizing a model server, exposing a minimal HTTP interface, setting latency/token SLOs, and wiring logs, metrics, and traces so issues are diagnosable.

Minimal Containerized Service

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir fastapi uvicorn[standard] transformers einops accelerate
COPY server.py .
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

# server.py
# Simple inference server with latency and token accounting hooks.
import time
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()
model_name = "HuggingFaceH4/zephyr-7b-alpha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

class Request(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.post("/generate")
def generate(req: Request):
    start = time.time()
    if not req.prompt:
        raise HTTPException(status_code=400, detail="prompt is required")
    inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
    input_tokens = inputs.input_ids.shape[-1]

    outputs = model.generate(
        **inputs,
        max_new_tokens=req.max_tokens,
        temperature=req.temperature,
        do_sample=req.temperature > 0,
    )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    output_tokens = outputs[0].shape[-1] - input_tokens
    latency_ms = int((time.time() - start) * 1000)

    # In production, push these metrics to your collector.
    print(
        {
            "event": "llm.generate",
            "latency_ms": latency_ms,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "model": model_name,
        }
    )
    return {"text": text, "latency_ms": latency_ms, "input_tokens": input_tokens, "output_tokens": output_tokens}

Notes:

Keep models small enough for your target instance (7B on 1–2 GPUs or strong CPU). For CPUs, prefer quantized variants.
Expose only the endpoints you need (/generate, /healthz) to reduce surface area.
Pin versions in requirements.txt for reproducibility.

Latency and SLOs

Set an explicit SLO: e.g., p95 latency ≤ 1200 ms for prompts ≤ 200 tokens and outputs ≤ 256 tokens.
Enforce input guards: reject prompts over a limit to prevent unbounded compute.
Add timeouts at the reverse proxy (Fly/Render/Gateway) and client SDK; align them with your p99 target.
Track cold starts separately from steady-state latency so you know whether to add a warm pool.

Token and Cost Accounting

Log input_tokens and output_tokens per request; store aggregates in Prometheus or a cheap time-series DB.
If you wrap paid APIs as fallbacks, tag requests by provider and model to attribute spend correctly.
Add a simple budget gate: if token output per minute exceeds a threshold, temporarily lower max_tokens or shed low-priority traffic.

Tracing

Use OpenTelemetry or a lightweight tracer to wrap the /generate handler:
- Span: llm.generate
- Attributes: model, latency_ms, input_tokens, output_tokens, temperature, max_tokens
- Events: start_load_model, end_load_model, start_generate, end_generate
Propagate a traceparent header from callers so you can connect frontend/API to inference spans.
Sample traces at a low rate (1–5%) in steady state; raise sampling when errors spike.

Deployment on a Budget

Fly: Use shared-cpu-1x for small models; attach a volume only if you need local weights caching. Pin to one region to avoid weight downloads per region. Add a fly.toml health check hitting /healthz.
Render: Use a starter service; mount a persistent disk if you want to cache weights. Set AUTO_SCALE=off for deterministic cost, then scale up intentionally.
Container registry: Push images to GHCR; keep them slim by pruning unused CUDA libraries when you target CPU.
Warm starts: Load the model at process start, not per request. If memory is tight, consider a single-worker process with a small concurrency limit.

Hardening and Safety

Add simple content filters or a moderation hook if you expose the endpoint publicly.
Rate-limit by API key to prevent runaway costs and to isolate abusive tenants.
Run with read-only filesystem where possible; keep only a cache directory writable for model weights.

Checklist to Ship

Docker image builds reproducibly; model pinned.
/generate enforces prompt/output limits and returns token counts and latency.
Logs include model, latency, input/output tokens; traces emit spans with attributes.
SLO defined (e.g., p95 ≤ 1200 ms @ 200-in/256-out) and monitored.
Deploy config for Fly/Render with health checks and timeouts set.