← Back to blog
Engineering Playbooks

Self-Hosting an LLM Inference Service with Observability

How to run your own LLM inference service with a small footprint, clear latency budgets, token-cost accounting, and end-to-end tracing.

By Rev.AISomething

Developer workstation with code on screen symbolizing self-hosted inference

Run your own inference endpoint with tight guardrails. This guide covers containerizing a model server, exposing a minimal HTTP interface, setting latency/token SLOs, and wiring logs, metrics, and traces so issues are diagnosable.


Minimal Containerized Service

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir fastapi uvicorn[standard] transformers einops accelerate
COPY server.py .
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
# server.py
# Simple inference server with latency and token accounting hooks.
import time
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()
model_name = "HuggingFaceH4/zephyr-7b-alpha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

class Request(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.post("/generate")
def generate(req: Request):
    start = time.time()
    if not req.prompt:
        raise HTTPException(status_code=400, detail="prompt is required")
    inputs = tokenizer(req.prompt, return_tensors="pt").to(model.device)
    input_tokens = inputs.input_ids.shape[-1]

    outputs = model.generate(
        **inputs,
        max_new_tokens=req.max_tokens,
        temperature=req.temperature,
        do_sample=req.temperature > 0,
    )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    output_tokens = outputs[0].shape[-1] - input_tokens
    latency_ms = int((time.time() - start) * 1000)

    # In production, push these metrics to your collector.
    print(
        {
            "event": "llm.generate",
            "latency_ms": latency_ms,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "model": model_name,
        }
    )
    return {"text": text, "latency_ms": latency_ms, "input_tokens": input_tokens, "output_tokens": output_tokens}

Notes:

  • Keep models small enough for your target instance (7B on 1–2 GPUs or strong CPU). For CPUs, prefer quantized variants.
  • Expose only the endpoints you need (/generate, /healthz) to reduce surface area.
  • Pin versions in requirements.txt for reproducibility.

Latency and SLOs

  • Set an explicit SLO: e.g., p95 latency ≀ 1200 ms for prompts ≀ 200 tokens and outputs ≀ 256 tokens.
  • Enforce input guards: reject prompts over a limit to prevent unbounded compute.
  • Add timeouts at the reverse proxy (Fly/Render/Gateway) and client SDK; align them with your p99 target.
  • Track cold starts separately from steady-state latency so you know whether to add a warm pool.

Token and Cost Accounting

  • Log input_tokens and output_tokens per request; store aggregates in Prometheus or a cheap time-series DB.
  • If you wrap paid APIs as fallbacks, tag requests by provider and model to attribute spend correctly.
  • Add a simple budget gate: if token output per minute exceeds a threshold, temporarily lower max_tokens or shed low-priority traffic.

Tracing

  • Use OpenTelemetry or a lightweight tracer to wrap the /generate handler:
    • Span: llm.generate
    • Attributes: model, latency_ms, input_tokens, output_tokens, temperature, max_tokens
    • Events: start_load_model, end_load_model, start_generate, end_generate
  • Propagate a traceparent header from callers so you can connect frontend/API to inference spans.
  • Sample traces at a low rate (1–5%) in steady state; raise sampling when errors spike.

Deployment on a Budget

  • Fly: Use shared-cpu-1x for small models; attach a volume only if you need local weights caching. Pin to one region to avoid weight downloads per region. Add a fly.toml health check hitting /healthz.
  • Render: Use a starter service; mount a persistent disk if you want to cache weights. Set AUTO_SCALE=off for deterministic cost, then scale up intentionally.
  • Container registry: Push images to GHCR; keep them slim by pruning unused CUDA libraries when you target CPU.
  • Warm starts: Load the model at process start, not per request. If memory is tight, consider a single-worker process with a small concurrency limit.

Hardening and Safety

  • Add simple content filters or a moderation hook if you expose the endpoint publicly.
  • Rate-limit by API key to prevent runaway costs and to isolate abusive tenants.
  • Run with read-only filesystem where possible; keep only a cache directory writable for model weights.

Checklist to Ship

  • Docker image builds reproducibly; model pinned.
  • /generate enforces prompt/output limits and returns token counts and latency.
  • Logs include model, latency, input/output tokens; traces emit spans with attributes.
  • SLO defined (e.g., p95 ≀ 1200 ms @ 200-in/256-out) and monitored.
  • Deploy config for Fly/Render with health checks and timeouts set.
LLM InferenceObservabilityCost ControlSLOs

Ready to launch your app?

By submitting this form you agree to our privacy policy.

Quote-ready scopes in 24 hours

  • Quote within 24 hours
  • Response within 2 hours
  • No commitment
We switched from the customer booking tool and the separate staff scheduler for one custom app that handles both. It fits how our shop runs and costs less than what we were paying before.
Lisa NguyenSMB salon owner
Book a free call