NVIDIA DGX
At-Home AI Stack — split-trust shared compute

Two clustered Nvidia DGX Spark nodes (arm64, Ubuntu 24.04) sharing a 256 GB unified memory pool through tensor parallelism (TP=2) over Ray on a 200 Gb/s direct-attach copper interconnect — but with split ownership. spark-01 is your node (your LiteLLM, your Open WebUI, your Hermes Agent, your n8n, your Tailscale). spark-02 is the client's node (their LiteLLM, their Open WebUI, their n8n, their Tailscale). Both LiteLLM proxies talk to the shared vLLM endpoint at spark-01:8000 over the DAC; neither application stack sees the other. Read the Trust model section before deploying — this architecture has specific properties at the API layer that you should understand explicitly.

2× DGX Spark vLLM TP=2 · Ray Qwen3.5-122B-A10B-GPTQ-Int4 LiteLLM × 2 Open WebUI × 2 n8n × 2 Hermes Agent Tailscale × 2 (separate tailnets) DAC interconnect arm64 native

Architecture

Two physically separate DGX Spark nodes share a single tensor-parallel vLLM cluster (TP=2 over Ray on a 200 Gb/s DAC link) — but each node runs its own independent application stack owned by a different party. The diagram shows three logical layers: application stacks (top, separate per owner), LiteLLM proxies (middle, one per side, separate keys and logs), and the shared compute pool (bottom, TP=2 across both nodes, served by the vLLM head on spark-01:8000). Tailscale sits as a separate overlay on each node — the DAC link is its own private hardware and does not traverse Tailscale.

Two-layer split-trust architecture. Top: two independent application stacks. Left = spark-01 (your node) with Your Open WebUI, Your n8n, Your Hermes, and Your LiteLLM, on your Tailscale tailnet. Right = spark-02 (client node) with Client Open WebUI, Client n8n, and Client LiteLLM, on a separate client Tailscale tailnet. Both LiteLLM proxies feed into a single shared compute pool below: vLLM head (Ray master, TP rank 0) on spark-01:8000, vLLM Ray worker (TP rank 1) on spark-02, the Qwen model in 256 GB unified memory, connected via the DAC link (200 Gb/s, enp1s0f0np0, 198.51.100.0/30). Your LiteLLM calls vLLM over localhost:8000; client's LiteLLM calls it over the DAC at 198.51.100.1:8000. NCCL collectives between head and worker flow on the DAC. Tailscale overlays carry application traffic only; the DAC is private physical hardware not routed through either tailnet. Your users browser · Telegram · VS Code · over your tailnet Client's users browser · over their tailnet spark-01 · YOUR NODE 192.0.2.21 (mgmt) · 198.51.100.1 (DAC) Your Tailscale tailnet · private overlay · ACLs you control Your Open WebUI :8080 · your data Your n8n :5678 · your flows Your Hermes Telegram · skills · memory Your LiteLLM · :8001 your master_key · your SQLite log corpus api_base = http://localhost:8000/v1 ↑ from your apps ↓ to shared vLLM spark-02 · CLIENT NODE 192.0.2.22 (mgmt) · 198.51.100.2 (DAC) Client's Tailscale tailnet · separate overlay · ACLs client controls Client Open WebUI :8080 · client's data Client n8n :5678 · client's flows Client LiteLLM · :8001 client's master_key · client's SQLite log corpus api_base = http://198.51.100.1:8000/v1 (over DAC) ↑ from client apps ↓ to shared vLLM localhost:8000 198.51.100.1:8000 (over DAC) SHARED COMPUTE POOL · TP=2 over Ray single vLLM endpoint at 198.51.100.1:8000 — both LiteLLM proxies talk to it vLLM head (Ray master) · TP rank 0 · spark-01:8000 · :6379 (Ray GCS) --tensor-parallel-size 2 · --distributed-executor-backend ray VLLM_HOST_IP=198.51.100.1 · NCCL_SOCKET_IFNAME=enp1s0f0np0 · GPU 0 vLLM Ray worker · TP rank 1 · spark-02 (no API listener) processes tensor activations only — no readable text VLLM_HOST_IP=198.51.100.2 · NCCL_SOCKET_IFNAME=enp1s0f0np0 · GPU 1 NCCL Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 — 256 GB unified memory pool bootstrap fallback: Qwen/Qwen3.6-35B-A3B-FP8 DAC · 200 Gb/s direct-attach copper enp1s0f0np0 · MTU 9216 · 198.51.100.0/30 NCCL allreduce · Ray control · NOT routed through either tailnet · separate from mgmt LAN

Trust model

Read this before deploying. The split-trust architecture has specific properties at the API layer that you should understand explicitly. None of this is a new risk introduced by the cluster — it is just the same trust profile you accept any time you use a hosted inference API, made visible.

API-layer visibility — spark-01 sees all prompts

The vLLM head process runs on spark-01 and serves the OpenAI-compatible API on port 8000. Both LiteLLM proxies — yours and the client's — call this endpoint. That means the owner of spark-01 can, in principle, observe every raw prompt and every model output that crosses the API surface. This is structurally identical to the trust profile of any commercial hosted-inference provider (OpenAI, Anthropic, Together, etc.): the entity running the API server can see traffic at the API layer.

Tensor-layer isolation — spark-02 sees only floats

The Ray worker on spark-02 processes tensor activations, not text. It receives intermediate floating-point tensors over NCCL allreduce on the DAC link and contributes its share of the matrix multiplications. The client's node never sees readable prompts or completions; it only sees the mathematical operations its TP rank is responsible for. NCCL traffic on the DAC carries floats, not strings.

Application-layer isolation — fully separate stacks

Knowledge bases, chat history, RAG pipelines, vector indexes, API keys, request logs, and OAuth tokens are completely separate on each node. Your Open WebUI's database is on spark-01; the client's is on spark-02. Your LiteLLM master key is yours; the client's is the client's. Neither party has access to the other's application stack — there is no cross-mounted volume, no shared Postgres, no shared file system. The only thing that crosses the boundary is the inference call from the client's LiteLLM into spark-01:8000.

Network isolation — separate tailnets, private DAC

Each node joins its owner's Tailscale tailnet independently. ACLs on each tailnet are controlled by that owner. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not routed through either Tailscale network and is not advertised on either tailnet. Tailscale carries application traffic only (clients reaching their own UIs); compute traffic stays on the DAC.

When this architecture is appropriate

  • Both parties have a working relationship and have agreed to this arrangement.
  • The data being sent for inference is not regulated (HIPAA / GDPR / SOC2 / PCI / etc.).
  • Both parties accept a trust model equivalent to using any commercial hosted-inference API.

When additional agreements are required

  • Either party handles regulated data — HIPAA, GDPR, SOC2, PCI-DSS, attorney-client privileged, or similar — in which case a written data processing agreement (DPA / BAA / equivalent) and audit controls are needed before traffic flows.
  • Either party has contractual data handling requirements imposed by their own customers or regulators.
  • The relationship is not pre-existing and the trust profile of "any hosted inference API" is not acceptable.
vLLM on spark-01:8000 has no authentication. Tailscale ACLs and host firewall rules are what prevent the client from bypassing their LiteLLM and hitting the unauthenticated endpoint directly. See Step 06 (Tailscale) for the ACL configuration that enforces this.

Hardware topology

NodeOwnerMgmt IPDAC IPServices
spark-01 You (private) 192.0.2.21 198.51.100.1 vLLM head (Ray master, TP rank 0), Your LiteLLM, Your Open WebUI, Your Hermes Agent, Your n8n, Your Tailscale
spark-02 Client (separate ownership) 192.0.2.22 198.51.100.2 vLLM Ray worker (TP rank 1), Client LiteLLM, Client Open WebUI, Client n8n, Client Tailscale

Interconnects

  • DAC interconnectenp1s0f0np0, MTU 9216, point-to-point 198.51.100.0/30. Carries NCCL for tensor-parallel collectives, Ray control, and the client LiteLLM's inference calls into spark-01:8000. Not routed through either Tailscale network.
  • Mgmt interconnect192.0.2.0/24 over RJ45, default routes. Used for SSH and node-bootstrap traffic during setup.
  • Tailscale (each owner) — each node independently joins its owner's tailnet. Application traffic (browser → Open WebUI, Telegram → Hermes, etc.) traverses Tailscale. The DAC link is never advertised onto either tailnet.
  • SSH — passwordless both directions between spark-01 and spark-02 at the mgmt IPs (required for the rsync step in Step 01). After setup, this can be locked down or removed.

Architecture principles

  1. Shared compute, split application. The vLLM cluster is the only shared resource. Application stacks above it (LiteLLM, Open WebUI, Hermes, n8n, Tailscale) are duplicated and independently owned.
  2. Both LiteLLMs hit the same vLLM endpoint. Your LiteLLM uses http://localhost:8000/v1; client's LiteLLM uses http://198.51.100.1:8000/v1 over the DAC. Neither proxy goes through the other's stack.
  3. Separate keys, separate logs, separate data. Each LiteLLM has its own master key; each Open WebUI has its own knowledge bases and chat history. Nothing is shared at the application layer.
  4. Tailscale is per-owner. Two separate tailnets, two separate ACL policies. Cross-tailnet traffic only happens if both owners explicitly configure it (which by default they do not).
  5. Single-instance per side. No clustered Open WebUI / clustered n8n on either node. HA modes are documented in the appendix only.

Network worksheet

Fill these in once. Every code block on this page that contains a matching placeholder (YOUR_NODE1_MGMT_IP, YOUR_USERNAME, etc.) will be live-substituted with the value you type — and a yellow highlight shows you what was filled in. Values are saved to your browser's localStorage so reloads keep them. Master keys, API keys, and other secrets are deliberately not in this worksheet — fill those into the relevant code blocks manually so they never touch localStorage.

Network worksheet — your IP slots Two boxes representing spark-01 (your node) and spark-02 (client node) with three IP slots each: mgmt, DAC, and Tailscale. The DAC link is shown between the two nodes. The mgmt LAN and the two tailnets are shown as separate networks each node attaches to. spark-01 — YOUR NODE mgmt IP YOUR_NODE1_MGMT_IP DAC IP YOUR_NODE1_DAC_IP Tailscale IP YOUR_NODE1_TAILSCALE_IP DAC · 200 Gb/s spark-02 — CLIENT NODE mgmt IP YOUR_NODE2_MGMT_IP DAC IP YOUR_NODE2_DAC_IP Tailscale IP YOUR_NODE2_TAILSCALE_IP Mgmt LAN · Your tailnet Mgmt LAN · Client's tailnet Your tailnet hostname (used in n8n WEBHOOK_URL) YOUR_TAILNET_HOSTNAME

spark-01 — your node

spark-02 — client node

Shared / per-host

Not saved
Secrets stay manual. YOUR_MASTER_KEY, YOUR_CLIENT_MASTER_KEY, and YOUR_BRAVE_API_KEY are intentionally not in this worksheet — fill those into the relevant code blocks by hand, and don't paste them into a browser-stored field. The worksheet only handles network identifiers and your username.

Prerequisites

  • Two Nvidia DGX Spark nodes — Grace CPU, GB10 GPU, arm64/aarch64, each running Ubuntu 24.04
  • Each node referred to as spark-01 (your node) and spark-02 (client node) — substitute your own hostnames
  • Both parties have read and accepted the Trust model section above
  • Docker installed and enabled on both nodes: sudo systemctl enable docker
  • Your Linux user added to the docker group on both nodes: sudo usermod -aG docker YOUR_USERNAME && newgrp docker
  • 200 Gb/s DAC link between the two nodes (interface enp1s0f0np0 on both, MTU 9216, point-to-point /30)
  • Mgmt LAN reachability between both nodes (1 GbE RJ45 with default routes)
  • Passwordless SSH both directions (spark-01 ↔ spark-02) — required for the HF cache rsync in Step 01
  • Mgmt-IP entries in /etc/hosts on both nodes so hostnames resolve to mgmt addresses, not the DAC IP (commands below)
  • Replace YOUR_USERNAME with your Linux username throughout
  • Replace YOUR_NODE1_MGMT_IP / YOUR_NODE2_MGMT_IP with each node's mgmt IP, and YOUR_NODE1_DAC_IP / YOUR_NODE2_DAC_IP with each node's DAC IP

Bootstrap on both nodes — /etc/hosts and docker group

By default, the hostname of each node resolves to its DAC IP (198.51.100.x), not the mgmt IP. SSH from one node to the other by hostname will fail until you anchor the hostnames to mgmt IPs explicitly.

#### Run on both spark-01 AND spark-02

bash
# Add mgmt-IP entries for both nodes
echo "YOUR_NODE1_MGMT_IP  spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP  spark-02" | sudo tee -a /etc/hosts

# Add your user to the docker group (then re-login or use newgrp)
sudo usermod -aG docker YOUR_USERNAME
newgrp docker

# Verify SSH by hostname both directions
ssh spark-01 hostname    # from spark-02
ssh spark-02 hostname    # from spark-01
If you skip the /etc/hosts step, the rsync of the Hugging Face cache between nodes (Step 01) and any later ssh spark-0X command will silently target the DAC interface — which won't have sshd bound to it unless you've changed defaults. The symptom is a "connection refused" or hang.
STEP 01

vLLM clustered — TP=2 over Ray on the DAC link

vLLM is the only clustered service. The model runs with tensor-parallel size 2: spark-01 hosts the Ray master and the vLLM head process; spark-02 hosts a Ray worker. NCCL traffic for tensor-parallel collectives flows over the DAC link (enp1s0f0np0, MTU 9216).

Production model and bootstrap fallback

TrackModelNotes
Production (default) Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 The intended daily driver. Larger MoE (122B / A10B), GPTQ-Int4 quantized — meaningfully better quality than the 35B, still fits comfortably in the clustered 256 GB pool (~34 GB weights per node + KV cache). At --gpu-memory-utilization 0.80 the RayWorkerWrapper shows roughly ~96 GB resident on each node (weights + KV cache). On a cold cache, the GPTQ-Marlin JIT compile adds ~10–20 minutes to first launch — see Cluster issues.
Bootstrap fallback Qwen/Qwen3.6-35B-A3B-FP8 The model used to bring the cluster up the first time. 35B MoE / A3B activation, FP8 quantized. Useful for fast iteration on cluster wiring (Ray, NCCL, DAC) before committing to the longer 122B load. Switching back is a flag change — see "Bootstrap fallback" below.
Tested but does not fit Qwen/Qwen3-235B-A22B-FP8 Does not fit. 235B at FP8 ≈ 235 GB total ≈ 117.5 GB per node — leaves no room for KV cache or OS on a 121 GB usable per-node pool. Ray kills the worker with an OutOfMemoryError at the 95% memory threshold. Use a GPTQ-Int4 quantization (a 235B INT4 lands at ~58 GB/node) or, for daily use, the 122B above.

Step 1a — Build the custom vLLM image on both nodes

The NGC image nvcr.io/nvidia/vllm:26.04-py3 ships without Ray. vLLM 0.19.0+nv26.04 hard-requires Ray for any multi-node inference — torch.distributed.run is not a substitute, because vLLM validates Ray at engine init regardless of launch method. You must build a custom image that adds Ray, on both nodes, before any cluster launch attempt. This is step zero, not optional.

#### Run on both spark-01 AND spark-02

bash
mkdir -p ~/sparky-ai-stack
cat > ~/sparky-ai-stack/Dockerfile.vllm-spark << 'EOF'
FROM nvcr.io/nvidia/vllm:26.04-py3
RUN pip install ray --quiet
EOF

cd ~/sparky-ai-stack
docker build -f Dockerfile.vllm-spark -t vllm-spark:26.04 .
Verify on each node: docker run --rm vllm-spark:26.04 ray --version — expected ray, version 2.x.x

Step 1b — Sync the Hugging Face cache to spark-02 over DAC

Both nodes need the model weights resident locally. Pull on spark-01 first (or use an existing ~/.cache/huggingface), then rsync to spark-02 across the DAC link.

#### spark-01

bash
# Pre-fetch the production 122B GPTQ-Int4 weights on spark-01.
# Run as your normal user (e.g. cameron) — NOT root. See callout below.
hf download Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
  --local-dir ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-GPTQ-Int4

# Rsync the cache to spark-02 over the DAC link
rsync -avh --progress \
  -e "ssh -o StrictHostKeyChecking=accept-new" \
  ~/.cache/huggingface/ \
  YOUR_NODE2_DAC_IP:/home/YOUR_USERNAME/.cache/huggingface/
Sending the cache over the DAC keeps it off your mgmt LAN — the 122B GPTQ-Int4 weights are roughly 60 GB. Verify the rsync ran across enp1s0f0np0 on spark-02 with nload enp1s0f0np0 in another shell during the transfer.
Run hf download as your normal user, not root. The legacy huggingface-cli command is deprecated; use hf download from the new huggingface_hub CLI. If you previously ran the download as root, the target directory will be root-owned and subsequent runs as a normal user fail with permission denied. Fix: sudo rm -rf ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-GPTQ-Int4, then recreate it and rerun the download as your user.

Step 1c — Startup scripts on each node

Place a startup script on each node so the cluster can be brought up reproducibly.

#### spark-01 — ~/sparky-ai-stack/scripts/vllm-head.sh

bash
mkdir -p ~/sparky-ai-stack/scripts
cat > ~/sparky-ai-stack/scripts/vllm-head.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

# DAC IP of THIS node (spark-01)
HOST_IP="YOUR_NODE1_DAC_IP"
DAC_IFACE="enp1s0f0np0"

docker rm -f vllm-qwen-122b 2>/dev/null || true

docker run -d \
  --name vllm-qwen-122b \
  --network host \
  --ipc host \
  --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e VLLM_HOST_IP="${HOST_IP}" \
  -e NCCL_SOCKET_IFNAME="${DAC_IFACE}" \
  -e GLOO_SOCKET_IFNAME="${DAC_IFACE}" \
  -e RAY_ADDRESS="${HOST_IP}:6379" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache" \
  --restart unless-stopped \
  vllm-spark:26.04 \
  bash -lc '
    set -e
    ray start --head \
      --node-ip-address="'"${HOST_IP}"'" \
      --port=6379 \
      --dashboard-host=0.0.0.0 \
      --num-gpus=1 \
      --block &
    # Was 20s previously — bumped to 60s so a simultaneous power-loss reboot of
    # both nodes still gives the worker container time to come up and register
    # before vLLM begins engine init. The until-loop below is the real guard,
    # but the longer initial sleep avoids racing the loop on a cold boot.
    sleep 60
    until ray status >/dev/null 2>&1; do sleep 1; done
    echo "[head] ray up, waiting for worker to join..."
    until [ "$(ray status 2>/dev/null | grep -c '"'"'1.0/1.0 GPU'"'"')" -gt 1 ] || \
          [ "$(ray status 2>/dev/null | grep -c '"'"'GPU'"'"')" -ge 2 ]; do sleep 2; done
    echo "[head] worker joined, starting vllm serve"

    exec vllm serve Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
      --served-model-name qwen3.5-122b \
      --dtype auto \
      --gpu-memory-utilization 0.80 \
      --max-model-len 65536 \
      --max-num-batched-tokens 4096 \
      --enable-auto-tool-choice \
      --tool-call-parser hermes \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --max-num-seqs 32 \
      --host 0.0.0.0 \
      --port 8000 \
      --reasoning-parser qwen3 \
      --default-chat-template-kwargs '"'"'{"enable_thinking": false}'"'"' \
      --tensor-parallel-size 2 \
      --pipeline-parallel-size 1 \
      --distributed-executor-backend ray
  '
EOF
chmod +x ~/sparky-ai-stack/scripts/vllm-head.sh

#### spark-02 — ~/sparky-ai-stack/scripts/vllm-worker.sh

bash
mkdir -p ~/sparky-ai-stack/scripts
cat > ~/sparky-ai-stack/scripts/vllm-worker.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

# DAC IPs
WORKER_IP="YOUR_NODE2_DAC_IP"   # this node (spark-02)
HEAD_IP="YOUR_NODE1_DAC_IP"     # ray head on spark-01
DAC_IFACE="enp1s0f0np0"

docker rm -f vllm-ray-worker 2>/dev/null || true

docker run -d \
  --name vllm-ray-worker \
  --network host \
  --ipc host \
  --gpus all \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e VLLM_HOST_IP="${WORKER_IP}" \
  -e NCCL_SOCKET_IFNAME="${DAC_IFACE}" \
  -e GLOO_SOCKET_IFNAME="${DAC_IFACE}" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache" \
  --restart unless-stopped \
  vllm-spark:26.04 \
  bash -lc '
    # Retry forever until the head is reachable; this is fine and expected
    until ray start \
            --address="'"${HEAD_IP}"':6379" \
            --node-ip-address="'"${WORKER_IP}"'" \
            --num-gpus=1 \
            --block; do
      echo "[worker] head not yet reachable, retrying in 3s..."
      sleep 3
    done
  '
EOF
chmod +x ~/sparky-ai-stack/scripts/vllm-worker.sh
Both containers run with --network host and --ipc host so NCCL and Ray see the real DAC interface and shared memory. VLLM_HOST_IP is set to each node's DAC IP — see the warning under Step 1d for why this is required.

Step 1c-bis — Persist the vLLM compile cache (both nodes)

On first launch, GPTQ-Marlin and torch.compile run JIT compilation that takes 10–20 minutes silently per node — the log will appear to stall after the engine reports weights loaded. The compiled artifacts land at /root/.cache/vllm/torch_compile_cache inside the container. Because the container is ephemeral (we recreate it on every vllm-head.sh / vllm-worker.sh run), that cache is lost every time and the cluster re-pays the full compile cost on every restart.

Fix: mount a host directory into the container so the cache survives container recreation. The two docker run commands above already include the mount:

bash
-v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache"

Create the host directory on each node before the first launch, otherwise Docker will create it root-owned and subsequent unprivileged access will fail:

bash
mkdir -p ~/sparky-ai-stack/vllm-compile-cache   # run on BOTH spark-01 and spark-02
After the first successful load the cache is populated. All subsequent starts skip compilation entirely and load in ~5 minutes instead of 20–30.
The cache is model and quantization specific. If you switch models (for example, falling back from the 122B GPTQ-Int4 to the 35B FP8), clear the cache directory first on both nodes — rm -rf ~/sparky-ai-stack/vllm-compile-cache/* — or you'll see kernel-shape mismatches at engine init.

Step 1d — Launch order: worker first, then head

Start the worker container first, then the head. This ordering is intentional:

  • The worker's ray start --address …:6379 will retry forever until the head's GCS comes up — this is the expected path and is harmless.
  • If the head starts first and vLLM begins engine init before the worker has joined Ray, the placement group fires with only one GPU visible and the run hangs in an indefinite allocation failure with no clean error.
  • The head script above blocks vllm serve behind a Ray-status check that waits for two GPUs to be registered, which makes the launch idempotent.

#### spark-02 — start the worker

bash
~/sparky-ai-stack/scripts/vllm-worker.sh
docker logs -f vllm-ray-worker   # leave open in another shell

Wait until the worker logs print Ray runtime started and stop spamming "head not yet reachable" retries (the head isn't up yet, but the container will keep trying).

#### spark-01 — start the head

bash
~/sparky-ai-stack/scripts/vllm-head.sh
docker logs -f vllm-qwen-122b

The head will: (1) start Ray as the master, (2) wait for the worker to register a second GPU into the cluster, then (3) start vllm serve. Allow ~5–8 minutes for the GPTQ-Int4 weights to load on both nodes before the engine is ready. On a cold cache the GPTQ-Marlin JIT compile silently adds another 10–20 minutes after weight loading completes — the log will appear to stall after ray_env.py:111 while the kernels compile inside the RayWorkerWrapper. This is expected; see Cluster issues. Compiled artifacts are cached so subsequent startups skip this step.

Step 1e — Verification

#### spark-01 — Ray cluster status

bash
docker exec vllm-qwen-122b ray status
Expected: 2 nodes total, 2.0/2.0 GPU, both DAC IPs listed (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP)

#### spark-01 — model endpoint

bash
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'
Expected: {"data":[{"id":"qwen3.5-122b",...}]} on the first call, a streamed completion on the second

#### Both nodes — GPU residency (GB10 quirk)

GB10 (Grace Blackwell) uses unified memory. The standard --query-gpu=memory.used,memory.total fields return [N/A] on this hardware — that is expected, not a bug. Use plain nvidia-smi and read the Processes section instead:

bash
nvidia-smi   # run on each node

You should see a RayWorkerWrapper (or vllm) process on each node with roughly ~96 GB resident at --gpu-memory-utilization 0.80 with the 122B GPTQ-Int4 model loaded (≈34 GB of weights per node + KV cache).

#### spark-02 — NCCL traffic on the DAC during inference

bash
nload enp1s0f0np0   # watch this while the curl chat-completion above runs

You should see Gb/s-class spikes on the DAC during decoding. If you see traffic on a different interface, NCCL_SOCKET_IFNAME didn't propagate — see the cluster troubleshooting section.

Bootstrap fallback — the 35B FP8 model

The Qwen/Qwen3.6-35B-A3B-FP8 model is the original bootstrap model and remains useful for fast iteration on cluster wiring (Ray, NCCL, DAC) before committing to the longer 122B load. If the cache is already populated with the 35B FP8 weights, swap the vllm serve line in ~/sparky-ai-stack/scripts/vllm-head.sh and the container name (vllm-qwen-122bvllm-qwen-35b):

bash
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name qwen3.6-35b \
  --dtype auto \
  --gpu-memory-utilization 0.80 \
  --max-model-len 131072 \
  --max-num-batched-tokens 4096 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-seqs 32 \
  --host 0.0.0.0 \
  --port 8000 \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --distributed-executor-backend ray

Update --served-model-name in the LiteLLM config in Step 02 (and the client's LiteLLM in Step 03) if you fall back. Expect ~97 GB resident per node at FP8 instead of ~96 GB at GPTQ-Int4.

STEP 02

Your LiteLLM proxy on spark-01

This is your LiteLLM proxy — your master key, your SQLite log corpus, your routing rules. It serves only your application stack on spark-01 (your Open WebUI, your Hermes, your n8n). The client gets their own separate LiteLLM in Step 03.

Your LiteLLM lives on the same node as the vLLM head and points at localhost:8000. The clustered vLLM presents one logical OpenAI-compatible endpoint — LiteLLM doesn't need to know there are two physical nodes behind it.

LiteLLM has no arm64 Docker image — install via pip directly on the host. This is unchanged from a single-node setup.

#### spark-01 — directories and install

bash
mkdir -p ~/sparky-ai-stack/logs
cd ~/sparky-ai-stack

pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
Verify: ~/.local/bin/litellm --version

#### spark-01 — config (single backend pointing at the local clustered vLLM)

bash
cat > ~/sparky-ai-stack/litellm-config.yaml << 'EOF'
model_list:
  - model_name: qwen3.5-122b
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://localhost:8000/v1
      api_key: "not-needed"

litellm_settings:
  verbose: true
  database:
    type: sqlite
    path: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.db
  log_config:
    level: INFO
    format: json
    filepath: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.log

router_settings:
  num_retries: 0
  timeout: 600
EOF
Remove old dual-model entries. If this LiteLLM was previously configured with a second backend (e.g. localhost:8002 for an "expert" model), delete every model and fallback entry that points at a port no longer running. LiteLLM will throw httpx.ConnectError on startup if any configured backend is unreachable.

#### spark-01 — systemd service

bash
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/sparky-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm

#### spark-01 — clean stale override.conf (only if upgrading from a previous setup)

If LiteLLM was previously run with STORE_MODEL_IN_DB=True and a DATABASE_URL in a systemd drop-in, those env vars persist even after you remove them from litellm-config.yaml. LiteLLM will fail to start with httpx.ConnectError against a Postgres that may no longer exist. Reset the drop-in:

bash
sudo mkdir -p /etc/systemd/system/litellm.service.d
sudo bash -c 'cat > /etc/systemd/system/litellm.service.d/override.conf << EOF
[Service]
Environment=PYTHONPATH=/home/YOUR_USERNAME/.local/lib/python3.12/site-packages
EOF'

sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
Verify from spark-01: curl http://localhost:8001/v1/models (no key needed if you didn't set master_key yet, otherwise use -H "Authorization: Bearer YOUR_MASTER_KEY")
Your LiteLLM should NOT be reachable from spark-02. If you are using Tailscale ACLs (Step 06), only your tailnet should reach spark-01:8001. The client uses their own LiteLLM (Step 03) — they never call yours.

Step 2d — PostgreSQL for the Admin UI and virtual keys (required)

The LiteLLM Admin UI and virtual-key generation both require PostgreSQL. SQLite is not supported — LiteLLM's Prisma schema is hardcoded for PostgreSQL, and the UI route returns table public.LiteLLM_UserTable does not exist if you try to point it at SQLite. The SQLite block in the config above is fine for request logging; it is not a substitute for the metadata DB.

#### spark-01 — bring up the litellm-db postgres container

On spark-01 the postgres container lives in ~/sparky-ai-stack/docker-compose.yml alongside n8n and hermes-webui:

yaml
services:
  litellm-db:
    image: postgres:16
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm
      - POSTGRES_DB=litellm
    volumes:
      - litellm_db:/var/lib/postgresql/data
    ports:
      - "5432:5432"

volumes:
  litellm_db:
bash
cd ~/sparky-ai-stack
docker compose up -d litellm-db
docker compose ps litellm-db

#### spark-01 — apply the Prisma schema

Install the Prisma CLI if missing, then push the LiteLLM Prisma schema into the new database. This must run once on spark-01 before LiteLLM starts, otherwise the UI will return table public.LiteLLM_UserTable does not exist.

bash
pip install prisma --break-system-packages

DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  prisma db push \
  --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Expected: Your database is now in sync with your Prisma schema. Done in <Ns>

#### spark-01 — wire the database into litellm-config.yaml

Add the database_url to general_settings (not at the top level — see troubleshooting) and enable model-in-DB storage so the UI can edit the model list:

yaml
general_settings:
  master_key: YOUR_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

litellm_settings:
  store_model_in_db: true

Restart and verify:

bash
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
curl -s http://localhost:8001/health/readiness | head
UI is at http://spark-01:8001/ui — log in with admin and your master key. Generate per-service virtual keys from the Virtual Keys tab (one for Open WebUI, one for n8n, one for Hermes — never paste the master key into a downstream service).
STEP 03

Client LiteLLM proxy on spark-02

The client gets their own LiteLLM proxy on spark-02, with their own master key, their own log corpus, and their own routing rules. It points at the shared vLLM endpoint over the DAC link. This is not a copy of spark-01's LiteLLM — it has no shared config, no shared key, no shared logs. The client controls their own master key and never shares it with you.

If you are setting up spark-02 on behalf of the client, hand off the master-key generation step (or have them rotate the key the moment they take over). The point of split trust is that you do not hold the client's API credentials.

#### spark-02 — install

bash
mkdir -p ~/sparky-ai-stack/logs
cd ~/sparky-ai-stack

pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

#### spark-02 — generate the client master key

Run on the client's terminal — store this key only on spark-02:

bash
echo "sk-client-$(openssl rand -hex 16)"

#### spark-02 — config (points at vLLM over the DAC)

bash
cat > ~/sparky-ai-stack/litellm-config.yaml << 'EOF'
model_list:
  - model_name: qwen3.5-122b
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://YOUR_NODE1_DAC_IP:8000/v1   # shared vLLM, over DAC
      api_key: "not-needed"

litellm_settings:
  verbose: true
  database:
    type: sqlite
    path: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.db
  log_config:
    level: INFO
    format: json
    filepath: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.log

general_settings:
  master_key: YOUR_CLIENT_MASTER_KEY    # set to the sk-client-... value above

router_settings:
  num_retries: 0
  timeout: 600
EOF
The DAC is the fastest path between the two nodes — substantially lower latency than going over the mgmt LAN. Do not point the client LiteLLM at YOUR_NODE1_MGMT_IP:8000 unless the DAC is down.

#### spark-02 — systemd service (independent of spark-01)

bash
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/sparky-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
sudo systemctl status litellm --no-pager

Verification — confirm the request hits spark-01:8000

#### spark-02 — local LiteLLM responds with the client key

bash
curl http://localhost:8001/v1/models \
  -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"

#### spark-02 — inference round-trip through the shared backend

bash
curl http://localhost:8001/v1/chat/completions \
  -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi from client"}],"max_tokens":16}'
On spark-01, tail -f ~/sparky-ai-stack/logs/litellm.log shows nothing — your LiteLLM is not in the path. Instead, run docker logs --tail 20 vllm-qwen-122b on spark-01 — you should see the new request reach the vLLM head.

Step 3d — PostgreSQL for the client Admin UI and virtual keys (required)

Same constraint as Step 02: the LiteLLM Admin UI and virtual-key generation require PostgreSQL — SQLite is not supported because LiteLLM's Prisma schema is hardcoded for PostgreSQL. spark-02 doesn't run a docker-compose stack, so we use a standalone postgres container.

#### spark-02 — standalone postgres container

bash
docker run -d --name litellm-db --restart unless-stopped \
  -e POSTGRES_USER=litellm \
  -e POSTGRES_PASSWORD=litellm \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  -v litellm_db:/var/lib/postgresql/data \
  postgres:16

#### spark-02 — apply the Prisma schema

bash
pip install prisma --break-system-packages

DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
  prisma db push \
  --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Expected: Your database is now in sync with your Prisma schema.

#### spark-02 — wire the database into the client litellm-config.yaml

database_url goes under general_settings, alongside the existing master_key. Add store_model_in_db: true under litellm_settings:

yaml
general_settings:
  master_key: YOUR_CLIENT_MASTER_KEY
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

litellm_settings:
  store_model_in_db: true
bash
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
Client UI is at http://spark-02:8001/ui — log in with admin and the client master key. The client generates their own per-app virtual keys from the Virtual Keys tab; you never see them.
If a virtual key was generated before the schema was fully applied (e.g. you generated a key, then re-ran prisma db push), the old key will appear in the UI but lookups will fail with Virtual key not found in LiteLLM_VerificationTokenTable. Delete it in the UI, restart LiteLLM, then generate a new one.
STEP 04

Your Open WebUI on spark-01

Your daily-driver chat interface, owned by you, on your node. It points at your LiteLLM at http://localhost:8001/v1. The client gets their own Open WebUI on spark-02 in Step 05 — neither side can see the other's chat history, knowledge bases, RAG documents, or API keys.

#### spark-01 — directory and run

bash
mkdir -p ~/sparky-ai-stack
cd ~/sparky-ai-stack

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="YOUR_MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:8080 from spark-01 (or via your tailnet — see Step 06), create your admin account, and confirm in Settings → Connections → OpenAI API:

SettingValue
API Base URLhttp://host.docker.internal:8001/v1
API Keyyour master_key
Default modelqwen3.5-122b
MemoryToggle ON (Settings → Personalization)
Send a chat message — it should stream back via your LiteLLM (Step 02) → vLLM head (Step 01).
STEP 05

Client Open WebUI on spark-02

The client's daily-driver chat interface, owned by the client, on the client's node. It points at the client's LiteLLM at http://localhost:8001/v1 — which in turn calls the shared vLLM head on spark-01:8000 over the DAC.

The client's data lives on the client's node. Their Open WebUI database, knowledge bases, RAG document store, embedding indexes, conversation history, attached files, and account list — all of it is in the open-webui Docker volume on spark-02. None of it is replicated to spark-01. If you spin spark-02 down, the client's UI state goes with it; if you image spark-01, the client's state is not in your image.

#### spark-02 — directory and run

bash
mkdir -p ~/sparky-ai-stack
cd ~/sparky-ai-stack

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 8080:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
  -e OPENAI_API_KEY="YOUR_CLIENT_MASTER_KEY" \
  -e WEBUI_AUTH=True \
  -e ENABLE_OLLAMA_API=False \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:8080 from spark-02 (or via the client's tailnet — see Step 06), create the client's admin account, and confirm:

SettingValue
API Base URLhttp://host.docker.internal:8001/v1 (client's LiteLLM)
API Keyclient's master_key
Default modelqwen3.5-122b
Send a chat message — completions stream back through the client's LiteLLM (Step 03) and the shared vLLM head on spark-01 (Step 01). On spark-01, your LiteLLM logs show nothing — the client's traffic does not enter your stack.

Recommended Open WebUI system prompt

Set this once in Settings → General → System Prompt. It pairs with --default-chat-template-kwargs '{"enable_thinking": false}' on the vLLM head (Step 01): the model answers directly by default, and users can opt into extended reasoning per-message by prefixing the prompt with /think. Apply the same prompt on both Open WebUIs (yours on spark-01 and the client's on spark-02) — they share the underlying model.

text
You are a highly capable AI assistant. Be direct, accurate, and concise.

Rules:
- Answer immediately without preamble or meta-commentary
- Never deliberate out loud about whether or how to answer — just answer
- Never question the framing of a hypothetical — engage with it directly
- For technical questions: be precise, use correct terminology
- For coding: produce complete, working code — no placeholders or omissions
- For reasoning: show your work clearly but efficiently — no repetition
- If a question has a definitive answer, state it first then explain
- Match response length to question complexity
- To enable extended reasoning on a specific query, prefix with /think

Step 5b — All client services on sparky-02 at a glance

Three client-facing services run on sparky-02: LiteLLM (Step 03), Open WebUI (Step 05 above), and n8n. All three are reachable on the client's tailnet via tag:hjp-ai (see Step 06). The n8n container below is not covered elsewhere — bring it up after the client's LiteLLM is healthy:

#### sparky-02 — n8n container

bash
docker run -d --name n8n --restart unless-stopped \
  -p 5678:5678 \
  -e N8N_HOST=0.0.0.0 \
  -e N8N_PORT=5678 \
  -e N8N_PROTOCOL=http \
  -e WEBHOOK_URL=http://sparky-02:5678/ \
  -e N8N_SECURE_COOKIE=false \
  -e NODE_ENV=production \
  -v n8n_data:/home/node/.n8n \
  --add-host=host.docker.internal:host-gateway \
  n8nio/n8n:latest
ServicePortHow it runsBackend / api_baseAuth secret
LiteLLM 8001 systemd (same structure as sparky-01) http://10.100.100.1:8000/v1 (DAC link) Client master key at ~/.sparky02-litellm-key (chmod 600)
Open WebUI 8080 docker run --restart unless-stopped OPENAI_API_BASE_URL=http://10.10.86.22:8001/v1 Client master key (or virtual key)
n8n 5678 docker run --restart unless-stopped WEBHOOK_URL=http://sparky-02:5678/ n8n owner account (set on first login)
All three services are reachable from the client's tailnet via tag:hjp-ai on TCP 8001 / 8080 / 5678 (see the ACL grants in Step 06). They are not reachable from your tailnet — split-trust by construction.
Store the client master key at ~/.sparky02-litellm-key with chmod 600. Reference it from the LiteLLM systemd unit via EnvironmentFile= rather than embedding it in litellm-config.yaml, so the file on disk does not contain the secret.
STEP 06

Tailscale (both nodes, separate tailnets)

Each node joins its owner's tailnet independently. Two separate tailnets, two separate ACL policies, two separate sets of users. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not advertised onto either tailnet, and it is not used for any cross-tailnet routing.

Security-critical: vLLM on spark-01:8000 has no authentication. Once Tailscale is configured, ensure your Tailscale ACLs do not expose port 8000 to the client's tailnet (and the client's ACLs do not expose your node's 8000 port to anyone either). The client must only reach spark-02:8001 (their own LiteLLM). If they can reach spark-01:8000 directly, they bypass their LiteLLM entirely and have unauthenticated inference access — which also means no key-scoped logging, no rate limit, and no audit trail.

spark-01 — your tailnet

#### spark-01 — install + join

bash
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner
tailscale ip -4   # note this address — your apps will be reachable here

In your tailnet's ACL policy (Tailscale admin console), expose your Open WebUI, your LiteLLM, and your other apps only to your own users. Example ACL fragment:

json
{
  "acls": [
    { "action": "accept",
      "src":    ["group:owner-users"],
      "dst":    ["tag:owner:8080", "tag:owner:8001", "tag:owner:5678", "tag:owner:8787"]
    },
    { "action": "accept",
      "src":    ["group:owner-users"],
      "dst":    ["tag:owner:22"]
    }
  ],
  "tagOwners": {
    "tag:owner": ["YOU@example.com"]
  },
  "groups": {
    "group:owner-users": ["YOU@example.com"]
  }
}

Do NOT add tag:owner:8000 to any allow rule. Port 8000 (vLLM) is unauthenticated and must remain reachable only from localhost (your LiteLLM in Step 02) and the DAC IP 198.51.100.1 (the client's LiteLLM in Step 03).

spark-02 — client's tailnet

#### spark-02 — install + join (client's auth key)

bash
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up \
  --authkey=<client-auth-key> \
  --hostname=sparky-02 \
  --advertise-tags=tag:hjp-ai

tailscale ip -4   # client's apps will be reachable here, only on their tailnet

The client's tailnet ACLs are theirs to author. Mirror the structure above with their own users and tags. The client should expose :8080 (their Open WebUI), :8001 (their LiteLLM, only if they want programmatic access from elsewhere), and :5678 (their n8n).

Disable key expiry for sparky-02 in the client's Tailscale admin (Machines → sparky-02 → ⋯ → Disable key expiry). Server nodes shouldn't drop off the tailnet on a 90-day timer; the auth key rotation should be a deliberate action.

#### Client tailnet ACL — replace the default allow-all

The default Tailscale policy allows everything between everyone. Replace it with this grants-based policy. It leaves all non-server devices unrestricted (so the client's existing fleet is untouched), keeps tag:hjp-dell-server open (their existing Dell server stays as-is), and restricts tag:hjp-ai (this node) to only the three service ports — 8001 (LiteLLM), 8080 (Open WebUI), 5678 (n8n).

json
{
  "grants": [
    {
      "src": ["*"],
      "dst": ["autogroup:member"],
      "ip": ["*"]
    },
    {
      "src": ["*"],
      "dst": ["tag:hjp-dell-server"],
      "ip": ["*"]
    },
    {
      "src": ["*"],
      "dst": ["tag:hjp-ai"],
      "ip": ["tcp:8001", "tcp:8080", "tcp:5678"]
    }
  ],
  "tagOwners": {
    "tag:hjp-ai":         ["autogroup:admin"],
    "tag:hjp-dell-server":["autogroup:admin"]
  }
}
After applying, from a client device on the same tailnet: nc -zv sparky-02 8001, nc -zv sparky-02 8080, and nc -zv sparky-02 5678 all succeed. nc -zv sparky-02 22 (SSH) and nc -zv sparky-02 8000 (raw vLLM) both fail — proving the ACL is in effect.
tag:hjp-ai is intentionally not given tcp:8000. Port 8000 is the raw, unauthenticated vLLM head on spark-01 reached over the DAC; it must never be exposed onto the client's tailnet. The client only ever calls their own LiteLLM on sparky-02:8001.

Lock down host firewalls

Tailscale ACLs are policy; the host firewall is enforcement. Apply ufw rules on both nodes so that even if Tailscale is misconfigured, ports 8000 and 8001 cannot leak to the wrong network.

#### spark-01 — host firewall

bash
# Allow from your tailnet (interface tailscale0) and DAC peer only
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp           # your LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp           # your Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp           # your n8n
sudo ufw allow in on tailscale0 to any port 8787 proto tcp           # your Hermes WebUI
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000 proto tcp   # client LiteLLM → vLLM only
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 6379 proto tcp   # Ray GCS over DAC
sudo ufw allow ssh                                                   # mgmt LAN ssh
sudo ufw enable

#### spark-02 — host firewall

bash
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp           # client LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp           # client Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp           # client n8n
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE1_DAC_IP                  # NCCL/Ray over DAC
sudo ufw allow ssh
sudo ufw enable
From the client's tailnet, nc -zv spark-01-tailscale-ip 8000 should fail with "connection refused" or "filtered". From the client's tailnet, curl http://spark-02-tailscale-ip:8001/v1/models -H "Authorization: Bearer CLIENT_KEY" should succeed.
STEP 07

Your Hermes Agent on spark-01

Hermes is your autonomous agent layer (skills, memory, cron, gateways). It runs on your node and talks to your LiteLLM. The client does not get a Hermes — they have their own Open WebUI and n8n on spark-02; if they want an agent they install their own.

#### spark-01 — install

bash
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
hermes --version

#### spark-01 — setup wizard

bash
hermes setup
PromptAnswer
Setup typeFull setup
ProviderCustom endpoint
API base URLhttp://localhost:8001/v1
API keyyour master_key
Modelqwen3.5-122b
Terminal backendLocal
Session reset modeInactivity + daily reset
Search providerFirecrawl Self-Hosted (or skip)
Launch chat now?n

The wizard writes ~/.hermes/config.yaml. base_url is local — Hermes and your LiteLLM both live on spark-01.

Telegram gateway

Create a bot via @BotFather (/newbot, copy the token) and get your user ID from @userinfobot. Then:

bash
hermes setup gateway   # select Telegram, paste token, paste user ID, choose System service

sudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pager

Hermes WebUI (Docker, spark-01)

#### spark-01 — docker-compose for hermes-webui

bash
cat > ~/sparky-ai-stack/hermes-webui.yml << 'EOF'
services:
  hermes-webui:
    image: ghcr.io/nesquena/hermes-webui:latest
    container_name: hermes-webui
    restart: unless-stopped
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      - WANTED_UID=1000
      - WANTED_GID=1000
      - HERMES_WEBUI_STATE_DIR=/home/hermeswebui/.hermes/webui-mvp
    volumes:
      - /home/YOUR_USERNAME/.hermes:/home/hermeswebui/.hermes
      - /home/YOUR_USERNAME/workspace:/workspace
    ports:
      - "8787:8787"
EOF

mkdir -p ~/workspace
docker compose -f ~/sparky-ai-stack/hermes-webui.yml up -d
Visit http://localhost:8787 on spark-01 (or your tailnet hostname).
STEP 08

Your n8n on spark-01

Your single-instance n8n on spark-01. The client runs their own n8n on spark-02 independently — different workflows, different credentials, different Postgres-vs-SQLite state. Neither side can read the other's flows.

#### spark-01 — docker-compose for n8n

bash
cat > ~/sparky-ai-stack/n8n.yml << 'EOF'
services:
  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=0.0.0.0
      - N8N_PORT=5678
      - N8N_PROTOCOL=http
      - WEBHOOK_URL=http://YOUR_TAILNET_HOSTNAME:5678/
      - N8N_SECURE_COOKIE=false
      - NODE_ENV=production
      - GENERIC_TIMEZONE=America/Los_Angeles
    volumes:
      - n8n_data:/home/node/.n8n
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  n8n_data:
EOF

docker compose -f ~/sparky-ai-stack/n8n.yml up -d
Visit http://localhost:5678 on spark-01 (or your tailnet hostname) and create the owner account.

Wire your n8n to your LiteLLM

In n8n, add an OpenAI credential pointing at your LiteLLM (local on spark-01):

FieldValue
API URLhttp://host.docker.internal:8001/v1
API Keyyour master_key
Default modelqwen3.5-122b
The client's n8n on spark-02 follows the same pattern but points at their LiteLLM (http://host.docker.internal:8001/v1 from inside their n8n container) with the client's master key. Step 03 covers the client's LiteLLM; the client deploys their n8n the same way you deploy yours.
CHECKLIST

Stack Validation Checklist

Both sides have to pass independently, and each side has to fail in the right places (you should not be able to reach the client's stack, and vice versa). Every item is labeled with the node it should be run from.

Shared compute pool

  • spark-01: docker exec vllm-qwen-122b ray status shows 2 nodes and 2.0/2.0 GPU, with both DAC IPs (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP) listed.
  • spark-01: nvidia-smi Processes section shows a RayWorkerWrapper at ~96 GB (122B GPTQ-Int4 at --gpu-memory-utilization 0.80). The --query-gpu=memory.used / memory.total fields return [N/A] on GB10 — expected, use the Processes section instead.
  • spark-02: nvidia-smi Processes section shows a RayWorkerWrapper at ~96 GB.
  • spark-01: curl http://localhost:8000/v1/models returns qwen3.5-122b (vLLM head).

Your side (spark-01)

  • spark-01: curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_MASTER_KEY" returns qwen3.5-122b through your LiteLLM.
  • spark-01: sudo systemctl status litellm hermes-gateway — both active (running).
  • spark-01: docker ps shows vllm-qwen-122b, open-webui, hermes-webui, and n8n all Up.
  • spark-01: Open http://localhost:8080 (or your tailnet hostname), send a chat message — completion streams back. Your LiteLLM log records the request.
  • spark-01: Send a Telegram message — Hermes responds. End-to-end your-side path: Telegram → Hermes → your LiteLLM → vLLM TP=2.

Client side (spark-02)

  • spark-02: curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" returns qwen3.5-122b through the client's LiteLLM.
  • spark-02: curl http://localhost:8001/v1/chat/completions -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" ... returns a completion.
  • spark-02: sudo systemctl status litellmactive (running). Logs are written to ~/sparky-ai-stack/logs/litellm.log on spark-02, separate from your logs.
  • spark-02: docker ps shows vllm-ray-worker, open-webui, and n8n all Up. (No Hermes container.)
  • spark-02: Open http://localhost:8080 (or the client's tailnet hostname), send a chat message — completion streams back. Client's LiteLLM log records the request; your LiteLLM log on spark-01 does not.

Cross-stack isolation (the negative tests)

  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8001/v1/models should fail (timeout or "no route to host"). Your LiteLLM is not exposed to the client's tailnet.
  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models should fail. The unauthenticated vLLM endpoint is not reachable from the client's tailnet.
  • From client's tailnet: curl http://YOUR_NODE1_TAILSCALE_IP:8080 should fail. Your Open WebUI is not reachable from the client's tailnet.
  • From your tailnet: curl http://YOUR_NODE2_TAILSCALE_IP:8080 should fail. The client's Open WebUI is not reachable from your tailnet.
  • spark-01: tail -f ~/sparky-ai-stack/logs/litellm.log while the client sends a chat message → your log is silent. Client's traffic does not enter your stack.

DAC traffic during inference

  • spark-02: nload enp1s0f0np0 spikes into Gb/s during decoding from either side — both your and the client's inference requests traverse the DAC (yours for NCCL collectives, theirs for both the API call to 198.51.100.1:8000 and NCCL).
  • Both sides simultaneously: have you and the client send a long prompt at the same time. Both completions should stream concurrently — vLLM's continuous batching handles the overlap.

Config validation

  • spark-01 (your Hermes config): YAML validation passes — a parse error will silently break Hermes by falling back to .env:
    bash
    python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
    Expected output: YAML valid
REF

Port reference

spark-01 (your node)

8000
vLLM head
Shared compute — localhost + DAC peer only · NO TAILSCALE
6379
Ray GCS
DAC peer only
8001
Your LiteLLM
Your tailnet only · your master_key
8080
Your Open WebUI
Your tailnet only
5678
Your n8n
Your tailnet only
8787
Your Hermes WebUI
Your tailnet only
8265
Ray dashboard
Optional · localhost only

spark-02 (client node)

vLLM worker
No API listener — Ray join over DAC only
8001
Client LiteLLM
Client's tailnet only · client's master_key
8080
Client Open WebUI
Client's tailnet only
5678
Client n8n
Client's tailnet only
Port 8000 on spark-01 must NEVER be advertised to either tailnet. The client's only path to it is through their own LiteLLM, which calls it over the DAC. If port 8000 leaks onto a tailnet, the client (or anyone else on that tailnet) gets unauthenticated inference access. The host firewall rules in Step 06 enforce this.
REF

File locations

spark-01 — your node

~/sparky-ai-stack/
├── Dockerfile.vllm-spark # custom image: NGC vllm + ray
├── litellm-config.yaml # YOUR LiteLLM — your master_key, localhost:8000
├── hermes-webui.yml # compose for your hermes-webui
├── n8n.yml # compose for your n8n
├── scripts/
    └── vllm-head.sh # Ray head + vllm serve TP=2
└── logs/
    ├── litellm.db # YOUR SQLite request log
    └── litellm.log # YOUR text log

~/.hermes/ # YOUR Hermes config, memory, skills
~/workspace/ # YOUR Hermes file workspace

Docker volumes:
├── open-webui # YOUR Open WebUI database, knowledge bases, RAG
└── n8n_data # YOUR n8n flows + credentials

/etc/systemd/system/
├── litellm.service # YOUR LiteLLM auto-start
├── litellm.service.d/override.conf # PYTHONPATH only
└── hermes-gateway.service # YOUR Telegram gateway auto-start

spark-02 — client node (separate ownership)

~/sparky-ai-stack/
├── Dockerfile.vllm-spark # identical custom image as spark-01
├── litellm-config.yaml # CLIENT LiteLLM — client's master_key, points DAC → vLLM
├── n8n.yml # compose for client's n8n
├── scripts/
    └── vllm-worker.sh # Ray worker join
└── logs/
    ├── litellm.db # CLIENT SQLite log — separate corpus
    └── litellm.log # CLIENT text log

Docker volumes:
├── open-webui # CLIENT Open WebUI database, knowledge bases, RAG
└── n8n_data # CLIENT n8n flows + credentials

/etc/systemd/system/
└── litellm.service # CLIENT LiteLLM auto-start

Backup targets

Node / OwnerPathContents
spark-01 — you~/sparky-ai-stack/logs/Your LiteLLM corpus
spark-01 — you~/.hermes/Your Hermes memory, sessions, skills
spark-01 — youDocker volumes open-webui, n8n_dataYour UI state — chat history, knowledge bases, flows
spark-02 — client~/sparky-ai-stack/logs/Client's LiteLLM corpus (their backup, not yours)
spark-02 — clientDocker volumes open-webui, n8n_dataClient's UI state (their backup, not yours)
Backups are owner-specific. Don't pull the client's volumes into your backup pipeline — that defeats the application-layer isolation. Each owner backs up their own node's state.
REF

Cluster issues

Two-node-specific failures and their resolutions. Most of these were discovered during live deployment.

ray: not found inside the vLLM container
SymptomContainer starts but the entrypoint shell can't find ray. ray start errors with "command not found".
CauseThe NGC base image nvcr.io/nvidia/vllm:26.04-py3 ships without Ray. vLLM 0.19.0+nv26.04 hard-requires Ray for any multi-node inference — torch.distributed.run is not a substitute, because vLLM validates Ray at engine init regardless of launch method.
FixBuild the custom vllm-spark:26.04 image on both nodes (Step 1a) before any cluster launch attempt: FROM nvcr.io/nvidia/vllm:26.04-py3 + RUN pip install ray --quiet.
Placement group allocation failed / node:192.0.2.x not found
SymptomvLLM logs show an indefinite placement-group allocation failure or a Ray placement spec referencing a node IP Ray doesn't recognize. The cluster has 2 GPUs registered, but vLLM only sees one.
CauseVLLM_HOST_IP is unset, so vLLM resolves the node's hostname to the mgmt IP (192.0.2.x) — but Ray registered each node on its DAC IP (198.51.100.x). The placement group spec then targets a node Ray has never seen.
FixSet VLLM_HOST_IP to each container's own DAC IP: on spark-01's head container, VLLM_HOST_IP=YOUR_NODE1_DAC_IP; on spark-02's worker container, VLLM_HOST_IP=YOUR_NODE2_DAC_IP. The Step 01 startup scripts set this for you.
"Tensor parallel size exceeds available GPUs (1)" warning
SymptomvLLM head logs print a warning that TP=2 exceeds the locally visible GPU count (1) before Ray finishes registering the second node.
CauseExpected and harmless on a 2-node 1-GPU-per-node setup. vLLM spreads TP workers across nodes via Ray. The warning fires before Ray confirms the second node's GPU is part of the cluster — not an error if the worker is up and joining.
FixIgnore. Confirm ray status reports 2.0/2.0 GPU after the worker container's join completes. The Step 01 head script blocks vllm serve behind this check, so under normal operation you won't see it.
Worker container starts before head — retry behavior is fine
Symptomdocker logs vllm-ray-worker on spark-02 prints "head not yet reachable, retrying in 3s..." in a loop while the head is still loading.
CauseThe worker connects to the head's GCS at YOUR_NODE1_DAC_IP:6379. If the head isn't up yet, the worker retries forever. This is the expected path — the Step 01 launch order is "worker first, then head" precisely so that the worker is already retrying when the head comes up.
FixLeave the worker running. As soon as the head's Ray master finishes initializing, the worker will join the cluster and the retries will stop.
Worker can't reach head — Ray GCS connection refused
SymptomWorker logs print "ConnectionError: ... 198.51.100.1:6379" continuously and never advance. The retry loop above never resolves.
CauseEither (a) the head container hasn't started, (b) VLLM_HOST_IP on the head was set to a non-DAC IP so Ray bound to the wrong interface, or (c) a firewall blocks port 6379 over the DAC.
FixOn spark-01: docker exec vllm-qwen-122b ray status — if Ray isn't running, restart the container. Confirm VLLM_HOST_IP=YOUR_NODE1_DAC_IP. From spark-02: nc -zv YOUR_NODE1_DAC_IP 6379 should connect; if not, check firewall rules on the DAC interface.
NCCL traffic on the wrong interface
SymptomTensor-parallel inference works but is slow. nload enp1s0f0np0 stays idle; mgmt LAN sees Gb/s spikes instead.
CauseNCCL_SOCKET_IFNAME didn't propagate into the container, so NCCL fell back to interface auto-detect and chose mgmt over DAC.
FixSet both NCCL_SOCKET_IFNAME=enp1s0f0np0 and GLOO_SOCKET_IFNAME=enp1s0f0np0 as -e flags on both containers (the Step 01 scripts do this). Also confirm the iface name is identical on both nodes — if one node has it as enp1s0f1np0, NCCL will fail.
nvidia-smi memory.used returns [N/A] on GB10
Symptomnvidia-smi --query-gpu=memory.used,memory.total --format=csv returns [N/A] in both fields. Memory monitoring scripts that depend on these fields silently report nothing.
CauseGB10 (Grace Blackwell) uses unified memory. The classic memory.used / memory.total NVML fields are not populated on this hardware.
FixUse plain nvidia-smi and read the Processes section. With the 122B GPTQ-Int4 model loaded at --gpu-memory-utilization 0.80, you should see a RayWorkerWrapper process at roughly ~96 GB on each node (≈34 GB weights + KV cache). The 35B FP8 bootstrap model lands closer to ~97 GB.
SSH by hostname hits the DAC instead of mgmt
Symptomssh spark-02 from spark-01 hangs or returns "connection refused", even though both nodes are reachable.
CauseBy default, each node's hostname resolves to its DAC IP (198.51.100.x). Unless you've explicitly bound sshd to the DAC interface, SSH only listens on the mgmt LAN — but your client has resolved the hostname to the DAC IP.
FixAdd mgmt-IP entries to /etc/hosts on both nodes (this is in Prerequisites): echo "YOUR_NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts and likewise for spark-02. The HF cache rsync in Step 01b uses the DAC IP explicitly, but every other ssh spark-0X command relies on hostname resolution.
docker: permission denied on spark-02
SymptomAny docker command on spark-02 fails with permission denied while trying to connect to the Docker daemon socket. Common on a freshly-imaged worker node where docker installed cleanly but the operator account isn't in the docker group yet.
FixAdd the operator account to the docker group. On a fresh spark-02: sudo usermod -aG docker cameron && newgrp docker (substitute your actual username). The new group only takes effect after re-logging or running newgrp docker in the current shell. The Step 01 worker script will not run until this is done.
LiteLLM httpx.ConnectError on startup
SymptomLiteLLM systemd service won't start. Logs show httpx.ConnectError against either Postgres or a model endpoint.
CauseTwo common causes: (1) a stale STORE_MODEL_IN_DB=True + DATABASE_URL in /etc/systemd/system/litellm.service.d/override.conf still pointing at a Postgres that no longer runs, or (2) litellm-config.yaml includes a model with api_base pointing at a dead port (e.g. localhost:8002 from a previous dual-model setup).
FixReset the override.conf to contain only PYTHONPATH (see Step 02). Audit litellm-config.yaml for any localhost:8002 or other dead endpoints and remove them — every model in model_list must have a live api_base.
vLLM CUDA compile cache error on reboot
SymptomRuntimeError: CUDA driver error: operation not permitted
Fixdocker exec vllm-qwen-122b rm -rf /tmp/torchinductor_root /root/.cache/vllm/torch_compile_cache/torch_aot_compile && docker restart vllm-qwen-122b
Client LiteLLM can't reach vLLM
SymptomClient LiteLLM on spark-02 returns connection refused or connect timeout on every request. ~/sparky-ai-stack/logs/litellm.log on spark-02 shows httpx.ConnectError against 198.51.100.1:8000.
CauseEither (a) the DAC link is down, (b) the vLLM head container on spark-01 is not running, or (c) the host firewall on spark-01 is blocking the DAC peer from reaching port 8000.
FixFrom spark-02: ping -c 3 YOUR_NODE1_DAC_IP (DAC reachable?), then nc -zv YOUR_NODE1_DAC_IP 8000. From spark-01: docker ps | grep vllm-qwen-122b (running?) and sudo ufw status | grep 8000. The Step 06 ufw rule should explicitly allow YOUR_NODE2_DAC_IP on port 8000 over enp1s0f0np0.
Client sees spark-01:8000 directly without auth
SymptomFrom the client's tailnet, curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models succeeds — meaning the client could bypass their LiteLLM and hit the unauthenticated vLLM directly.
CausevLLM has no auth. Either (a) Tailscale ACLs on spark-01 inadvertently expose port 8000, or (b) the host firewall on spark-01 is allowing inbound on tailscale0 for port 8000, or (c) ufw is disabled.
FixAudit the ACL JSON on spark-01's tailnet — there must be no tag:owner:8000 entry anywhere. Then on spark-01: sudo ufw status verboseno rule should allow port 8000 on tailscale0. Only the explicit DAC-peer rule (enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000) should appear. If you find a leak, fix the ACL and ufw, then re-run the negative tests in the validation checklist.
Both LiteLLMs collide on port 8001 over the mgmt LAN
SymptomYou expected to curl http://YOUR_NODE2_MGMT_IP:8001/v1/models from spark-01 and get the client's LiteLLM, and it works — leaking the client's API surface to your mgmt LAN.
CauseBy design, both LiteLLMs bind to 0.0.0.0:8001 so each is reachable from its own tailnet. Your firewall must restrict inbound on the mgmt-LAN interface to deny port 8001 cross-node — Tailscale ACLs alone don't help here because the mgmt LAN is not part of any tailnet.
FixOn both nodes, deny port 8001 inbound on the mgmt-LAN interface explicitly. The Step 06 ufw rules already restrict 8001 to tailscale0 only — confirm with sudo ufw status verbose that no rule allows 8001 on the wildcard interface or on the mgmt iface.
Client can see your LiteLLM logs (or vice versa)
SymptomYou find an entry in ~/sparky-ai-stack/logs/litellm.log on spark-01 that you didn't make. Or the client reports a request in their log that they didn't send.
CauseAlmost always: someone reused a master key across the two LiteLLMs, or the Open WebUI on one side is misconfigured to point at the other side's LiteLLM.
FixConfirm master_key values are unique on each node — they should never match. Then check each Open WebUI's API base URL: yours should be http://host.docker.internal:8001/v1 (your local LiteLLM); client's should be the same hostname (their local LiteLLM, not your tailnet IP). If you find cross-pointing, fix it and rotate both master keys.
235B FP8 model OOM on the 256 GB cluster
SymptomLoading Qwen/Qwen3-235B-A22B-FP8 at TP=2 fails. Ray kills the worker with an OutOfMemoryError at the ~95% memory threshold during weight loading; the head logs report a placement-group failure shortly after.
Cause235B at FP8 is ~235 GB of weights total. Split across TP=2 that's ~117.5 GB per node — and only ~121 GB is usable per node, leaving no room for KV cache or OS overhead.
FixUse a GPTQ-Int4 quantization instead. A 235B INT4 model is ~117 GB total ≈ ~58 GB per node and fits with comfortable headroom. Or use a smaller model — the 122B GPTQ-Int4 above is the recommended production choice.
AssertionError: block_size (2096) must be <= max_num_batched_tokens (2048)
SymptomvLLM head exits during engine init on a Qwen MoE model with the assertion above.
CausevLLM's Mamba cache align mode auto-enables for MoE models when --enable-prefix-caching is on, which sets block_size=2096. The default max_num_batched_tokens=2048 is one block too small.
FixAdd --max-num-batched-tokens 4096 to the vllm serve command (the Step 01 head script does this). 4096 is the smallest power-of-two that satisfies the assertion with headroom.
GPTQ-Marlin first-run takes 10–20 minutes with no log output
SymptomOn a cold cache with the 122B GPTQ-Int4 model, docker logs -f vllm-qwen-122b appears to stall after a line ending in ray_env.py:111. No new output for many minutes. The cluster looks frozen.
CauseGPTQ-Marlin requires torch.compile JIT compilation on first launch. The compile happens silently inside the RayWorkerWrapper process after weight loading completes — vLLM doesn't surface a progress line for it.
FixWait it out — first launch typically takes 10–20 minutes for the compile. Verify progress with nvidia-smi on each node: both should show a RayWorkerWrapper at roughly ~34 GB during compile and growing toward ~96 GB once KV cache initializes. The compiled artifacts are cached at /root/.cache/vllm/torch_compile_cache/, so subsequent startups skip this step entirely.
huggingface-cli is deprecated
SymptomRunning huggingface-cli download … prints a deprecation notice, or the command is missing on a fresh install of huggingface_hub.
FixUse the new CLI: hf download <model> --local-dir <path>. Step 01 has been updated to use hf; if you have an older snippet around, swap the binary name and the flags map cleanly.
permission denied when downloading to the HF cache
Symptomhf download (or huggingface-cli download) fails with PermissionError: [Errno 13] Permission denied on a path under ~/.cache/huggingface/hub/.
CauseA previous download was run with sudo or as root, which left the target directory root-owned. The current shell (running as cameron or your normal user) cannot write into it.
Fixsudo rm -rf <target_dir>, then recreate the directory and rerun the hf download command as your normal user (not root). All hf download commands in this guide assume the non-root user.
Model "thinks" out loud / verbose deliberation in Open WebUI
SymptomConversational queries cause the model to produce long deliberative preambles ("Let me think about this..." or visible chain-of-thought). Latency climbs and token usage balloons. Sometimes the model loops in its own reasoning trace.
CauseThe chat template was rendered with preserve_thinking: true (or any other flag that exposes the visible CoT track). The Qwen3.5 family produces extended reasoning whenever thinking is enabled at template time, and conversational queries trigger excessive deliberation under that setting.
FixSet --default-chat-template-kwargs '{"enable_thinking": false}' on the vllm serve command (the Step 01 head script does this). Users can opt into extended reasoning per-message by prefixing a prompt with /think. This matches the behavior the recommended Open WebUI system prompt (Step 05) is calibrated for.
LiteLLM UI shows: table public.LiteLLM_UserTable does not exist
SymptomThe LiteLLM Admin UI loads but every page (Users, Virtual Keys, Models) returns an error referencing a missing LiteLLM_* table in the public schema.
CauseThe Postgres database is up and the connection succeeded, but the Prisma schema has not been pushed yet — the database is empty.
FixRun prisma db push against the LiteLLM Prisma schema:
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" prisma db push --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma — then restart LiteLLM. See Step 2d / 3d.
LiteLLM UI shows: Not connected to DB
SymptomThe Admin UI banner reports "Not connected to DB" even though Postgres is running and reachable on localhost:5432.
CauseEither (a) database_url is at the top level of litellm-config.yaml instead of nested under general_settings — LiteLLM only reads it from general_settings, or (b) the config still points at SQLite. SQLite is not supported for the UI; the Prisma schema is hardcoded for PostgreSQL.
FixMove the entry under general_settings:
general_settings:
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

and restart LiteLLM. Confirm with journalctl -u litellm -n 50 --no-pager — startup should report a successful Postgres connection.
Virtual key not found in LiteLLM_VerificationTokenTable
SymptomA virtual key generated from the UI is rejected on use with a VerificationToken not found error, but the key is still visible in the Virtual Keys tab.
CauseThe key was generated against the database before the schema was fully initialized (e.g. you generated a key, then re-ran prisma db push, which dropped/recreated the verification token table).
FixDelete the orphaned key in the UI, restart LiteLLM (sudo systemctl restart litellm), then generate a fresh key. The new key will be inserted into the current schema and will validate correctly.
REF

Other known issues

LiteLLM port already in use
Symptomsystemd shows [Errno 98] address already in use
Fixpkill -f litellm && sudo systemctl restart litellm
hermes: command not found
SymptomHermes installed but shell can't find it
Fixsource ~/.bashrc or use /home/YOUR_USERNAME/.local/bin/hermes
Hermes gateway systemd install fails
Symptomsudo: hermes: command not found
Fixsudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
MCP stdio servers fail health check — npx not in LiteLLM's PATH
SymptomAdding a stdio-based MCP server (e.g. Brave Search) via the LiteLLM UI succeeds but the health check after saving fails. No error is shown — the check just doesn't pass.
CauseNode.js was installed via nvm or a user-local installer, which places binaries in ~/.local/bin and adds that path to the user's shell rc file (.bashrc). LiteLLM — running as a systemd service — never sources .bashrc, so it gets a clean environment with no ~/.local/bin in PATH and cannot find npx when it tries to spawn the stdio MCP server process.
FixSymlink Node and npx into a system-wide PATH location: sudo ln -s /home/YOUR_USERNAME/.local/bin/node /usr/local/bin/node and sudo ln -s /home/YOUR_USERNAME/.local/bin/npx /usr/local/bin/npx. Alternatively, install Node system-wide via NodeSource: curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt install -y nodejs. After either fix, retry the health check in the LiteLLM UI — it should pass.
Signal on arm64
Symptomsignal-cli 0.14.x requires Java 25 (not in Ubuntu 24.04 apt repos). Signal servers block datacenter IPs at TLS level.
FixUse Telegram — no IP restrictions, native arm64 support.
config.yaml YAML error causes silent .env fallback
SymptomAfter manually editing ~/.hermes/config.yaml, Hermes seems to ignore the changes — MCP is broken and the provider reverts to default. No obvious error in the logs.
CauseA YAML syntax error causes Hermes to silently fall back to .env values. No error is shown at startup, so the misconfiguration is invisible.
FixAlways validate YAML after editing: python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')". If the command prints anything other than YAML valid, fix the syntax before restarting.
Hermes crashes on startup — MCP server connection failing
SymptomHermes exits during startup with a stack trace mentioning mcp_servers or a failed MCP client handshake. Logs reference an asyncio error or a TCP/stdio connection that never opened.
CauseThis stack proxies all MCP traffic through LiteLLM (see Step 02). Hermes does not need a direct mcp_servers block in ~/.hermes/config.yaml — it talks to LiteLLM's MCP endpoint over the LiteLLM API. A mcp_servers block in Hermes config is only correct in a direct-MCP (non-LiteLLM-proxied) topology, and on this stack it points Hermes at servers it can't reach.
FixRemove the mcp_servers block from ~/.hermes/config.yaml entirely. Validate the YAML, then restart: systemctl --user restart hermes (or whichever supervisor you use). MCP tool calls continue to work because LiteLLM is in the path.
custom_providers indentation error at startup
SymptomHermes fails to start with a YAML parse error after adding a custom provider to ~/.hermes/config.yaml.
CauseList items under custom_providers must be indented exactly 2 spaces. A common mistake is using 4 spaces or no indentation.
Fix Use this exact indentation:
yaml
custom_providers:
  - name: MyProvider
    base_url: http://localhost:8001/v1
    model: my-model
APPENDIX

Clustered Open WebUI / n8n (HA notes)

Both Open WebUI and n8n have HA modes available, but for a two-node home/lab setup the operational complexity is not worth it. This stack runs them as single instances on spark-02. If you ever want to pursue HA, here are the pointers.

Open WebUI HA

  • Switch the open-webui container's storage from a Docker volume to a Postgres backend (env: DATABASE_URL=postgresql://...) and a shared filesystem for uploads and RAG documents.
  • Run multiple replicas behind a TCP load balancer. Sticky sessions are recommended for SSE chat streams.
  • Postgres can sit on either node; if you put it on spark-01 you'll re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Prefer a third small box or a dedicated HA pair.

n8n HA

  • n8n's queue mode requires Postgres for state and Redis for the BullMQ queue. The main container becomes the main instance; one or more worker instances pull jobs off the queue.
  • Set EXECUTIONS_MODE=queue, QUEUE_BULL_REDIS_HOST=…, DB_TYPE=postgresdb, and the relevant Postgres env vars on every container. Replicas need the same N8N_ENCRYPTION_KEY.
  • For a two-node setup the simplest variant is one main on spark-02 and one worker on a third small box, with Postgres + Redis colocated on the third box.
  • Webhook traffic should hit only the main container; long-running executions land on workers transparently.
If you genuinely need HA at this layer, the operational answer is usually "add a third box for Postgres + Redis," not "split across the two GPU nodes." Putting stateful services on the inference nodes will compromise either inference latency or HA availability — usually both.
OBSIDIAN VAULT SYNC + MCP

Architecture

Desktop and mobile clients run Obsidian with Syncthing for bidirectional vault sync. The vault syncs to a dedicated Proxmox LXC over the local network. When traveling, Tailscale bridges the connection — clients queue changes offline and sync when Tailscale is enabled on both ends.

Desktop Client (Obsidian + Syncthing) ─┐
Mobile Client  (Obsidian + Syncthing) ─┼─→ Syncthing ─→ LXC /vault
                                       ┘         ↓
                                          obsidian-mcp (stdio)
                                                 ↓
                                          supergateway (streamableHttp, port 3000)
                                                 ↓
                                          LiteLLM ([YOUR-AI-SERVER-HOSTNAME])
                                                 ↓
                                    Chat clients / API consumers

LXC setup

SettingValue
OSUbuntu 24.04 LTS
Hostname[YOUR-LXC-HOSTNAME]
IP[YOUR-LXC-IP] on VLAN [YOUR-VLAN-ID]
Vault path/vault with .obsidian/app.json stub (required for MCP server vault validation)
Servicessyncthing@root and obsidian-mcp — both enabled as systemd services
OBSIDIAN VAULT SYNC + MCP

Setup script

A single script provisions the full stack from a fresh Ubuntu 24.04 LXC.

bash
#!/bin/bash
set -e
# ── Configuration ─────────────────────────────────────────────
SYNCTHING_USER="root"   # Change if running as non-root user
# ──────────────────────────────────────────────────────────────

# 1. Install dependencies
apt update
apt install -y curl gpg apt-transport-https

# 2. Install Node.js 22 + npm
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt install -y nodejs

# 3. Install Syncthing
curl -fsSL https://syncthing.net/release-key.gpg | gpg --dearmor -o /usr/share/keyrings/syncthing-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/syncthing-archive-keyring.gpg] https://apt.syncthing.net/ syncthing stable" > /etc/apt/sources.list.d/syncthing.list
apt update
apt install -y syncthing

# 4. Create vault directory with Obsidian config
mkdir -p /vault/.obsidian
echo '{}' > /vault/.obsidian/app.json

# 5. Enable and start Syncthing
systemctl enable syncthing@${SYNCTHING_USER}
systemctl start syncthing@${SYNCTHING_USER}

# 6. Wait for Syncthing config to generate
sleep 8

# 7. Expose Syncthing GUI on all interfaces
CONFIG_PATH=$(find /root -name "config.xml" 2>/dev/null | grep syncthing | head -1)
sed -i 's|<address>127.0.0.1:8384</address>|<address>0.0.0.0:8384</address>|' "$CONFIG_PATH"
systemctl restart syncthing@${SYNCTHING_USER}

# 8. Install obsidian-mcp and supergateway
npm install -g obsidian-mcp supergateway

# 9. Create obsidian-mcp systemd service
cat > /etc/systemd/system/obsidian-mcp.service << 'EOF'
[Unit]
Description=Obsidian MCP Server
After=network.target

[Service]
Type=simple
User=root
ExecStart=supergateway --stdio "obsidian-mcp /vault" --port 3000 --outputTransport streamableHttp --stateful --protocolVersion 2025-11-25
Restart=on-failure
RestartSec=10
MemoryMax=512M

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable obsidian-mcp
systemctl start obsidian-mcp

echo "Syncthing UI: http://$(hostname -I | awk '{print $1}'):8384"
echo "MCP endpoint: http://$(hostname -I | awk '{print $1}'):3000/mcp"

Manual steps after script

  1. Open Syncthing UI at http://[YOUR-LXC-IP]:8384
  2. Remove the default folder, add /vault, set a GUI password
  3. Pair each client device — add LXC device ID in client Syncthing UI, accept the incoming request on the LXC side, share the vault folder
  4. In LiteLLM UI → MCP Servers → Add: URL http://[YOUR-LXC-IP]:3000/mcp, transport http, auth None
  5. In Hermes ~/.hermes/config.yaml, add under mcp_servers: obsidian: url: http://[YOUR-LXC-IP]:3000/mcp and transport: streamableHttp
  6. Mandatory verification — run this before proceeding. If it fails, the MCP server is not reachable and the next steps will not work:
    bash
    hermes mcp test obsidian
    Expected output: ✓ Connected and ✓ Tools discovered: 11. If this fails, check that the obsidian-mcp systemd service is running on the LXC (systemctl status obsidian-mcp), that port 3000 is reachable from the DGX, and that the URL in config.yaml is correct.
  7. Update ~/.hermes/skills/note-taking/obsidian/SKILL.md with the correct MCP tool names (create the directory if it doesn't exist). The built-in Hermes obsidian skill will use filesystem commands unless this file explicitly names the MCP tools:
    bash
    mkdir -p ~/.hermes/skills/note-taking/obsidian
    cat > ~/.hermes/skills/note-taking/obsidian/SKILL.md << 'EOF'
    # Obsidian Vault Skill
    
    Always use MCP tools for all vault operations. Never use filesystem commands.
    
    ## Required: discover vault name first
    Always call `list-available-vaults` before any other tool to discover the vault name.
    Do not assume a vault name — always look it up.
    
    ## Available MCP tools
    - list-available-vaults   — discover vault name (call first, every time)
    - search-vault            — full-text search across all notes
    - read-note               — read a specific note by path
    - create-note             — create a new note
    - edit-note               — edit an existing note
    - move-note               — move or rename a note
    - delete-note             — delete a note
    - create-directory        — create a folder in the vault
    - add-tags                — add tags to a note
    - remove-tags             — remove tags from a note
    - rename-tag              — rename a tag across all notes
    EOF
  8. Add mcp to the telegram platform toolset in ~/.hermes/config.yaml and restart hermes-gateway

Services reference

ServiceCommandPort
Syncthingsystemctl status syncthing@root8384
Obsidian MCPsystemctl status obsidian-mcp3000

LiteLLM MCP configuration

FieldValue
MCP Server URLhttp://[YOUR-LXC-IP]:3000/mcp
Transporthttp
AuthNone
OBSIDIAN VAULT SYNC + MCP

Bugs & fixes encountered

1 — Syncthing config path changed on Ubuntu 24.04
SymptomScript using /root/.local/share/syncthing/config.xml path fails — file not found.
FixOn Ubuntu 24.04, config is at /root/.local/state/syncthing/config.xml, not /root/.local/share/syncthing/config.xml as documented elsewhere. The setup script uses find to locate the actual path dynamically.
2 — obsidian-mcp vault validation requires .obsidian/app.json
SymptomError: Not a valid Obsidian vault when starting obsidian-mcp against an empty directory.
Fixobsidian-mcp requires a .obsidian directory with app.json present to consider a directory a valid vault. An empty directory fails. Fix: mkdir -p /vault/.obsidian && echo '{}' > /vault/.obsidian/app.json.
3 — SSE transport incompatible with LiteLLM v1.80.18+
SymptomConnection closed errors when LiteLLM tries to connect to the MCP server using SSE transport.
CauseLiteLLM's MCP client uses protocol version 2025-11-25. SSE transport is incompatible with this version.
FixUse --outputTransport streamableHttp in the supergateway command and set transport to http in the LiteLLM UI.
4 — supergateway stateless mode process explosion
Symptomobsidian-mcp child processes accumulate under load, hitting pthread_create: Resource temporarily unavailable and crashing the server.
CauseIn default stateless mode, supergateway spawns a new obsidian-mcp child process per HTTP request. Under load these accumulate and exhaust thread limits.
FixAdd --stateful flag to keep one persistent child process instead of spawning per request.
5 — Protocol version mismatch causes SIGTERM
SymptomLiteLLM connects, sends protocolVersion: 2025-11-25, server responds with 2024-11-05, then LiteLLM sends SIGTERM and closes the session.
FixAdd --protocolVersion 2025-11-25 to the supergateway command so the handshake version matches what LiteLLM expects.
6 — LXC disk corruption on network-backed storage
SymptomEXT4 I/O errors and disk corruption under write load when LXC root disk is stored on a CIFS/SMB network share.
FixUse local storage or a reliable block storage backend for LXC root disks. Network-backed storage is not suitable for ext4 journaling under concurrent write load.
7 — Syncthing folder path special characters
SymptomSyncthing folder fails to start or reports an incorrect path after pasting from a file manager.
FixCopying a folder path from a file manager may add escape characters. Set the Syncthing folder path manually in the config UI instead of pasting from the file manager.
8 — MCP not available in Telegram via Hermes
SymptomVault queries work in Hermes WebUI but MCP tools are not available in Telegram sessions.
CauseHermes platform_toolsets for telegram only includes hermes-telegram by default — MCP tools are not loaded for Telegram sessions.
FixAdd mcp to the telegram list in ~/.hermes/config.yaml and restart hermes-gateway.
9 — YAML syntax error after config edit
SymptomHermes fails to start with a YAML parser error after editing ~/.hermes/config.yaml. Or worse — Hermes starts but silently falls back to .env values, breaking MCP and provider config with no obvious error.
CauseIncorrect indentation or a syntax error when editing the config file. A parse failure causes Hermes to silently ignore the config and fall back to defaults.
bash
python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
FixAlways validate YAML after editing using the command above. Expected output: YAML valid. If an exception is raised instead, fix the indentation and re-run until the validation passes before restarting any service.
15 — Hermes obsidian skill uses filesystem commands instead of MCP tools
SymptomVault queries sent via Telegram return filesystem errors or empty results, even though the MCP server is running and LiteLLM can reach it successfully.
CauseThe built-in Hermes obsidian skill will attempt to use filesystem commands (e.g. cat, find) to read vault files unless the SKILL.md explicitly references the correct MCP tool names. Adding the MCP server to config.yaml alone is not sufficient — the skill must also be rewritten to use the MCP tools exposed by obsidian-mcp.
FixRun hermes mcp test obsidian first to confirm connectivity and discover the exact tool names. Then update ~/.hermes/skills/note-taking/obsidian/SKILL.md to explicitly list the MCP tool names: list-available-vaults, search-vault, read-note, create-note, edit-note, move-note, delete-note, create-directory, add-tags, remove-tags, rename-tag. The SKILL.md must instruct Hermes to always call list-available-vaults first before any other tool call. See the setup steps above for the full SKILL.md content.
BRAVE SEARCH MCP

Web Search Tool Use

Adds live web search as a callable tool — the AI model can run Brave Search queries during a conversation in response to tool calls from Open WebUI and other clients. LiteLLM spawns the server on demand via stdio using an API key you supply.

Step 1 — Get a Brave Search API key

Go to api.search.brave.com, create a free account, and generate an API key under the Data for AI plan (free tier supports up to 2,000 queries/month).

Step 2 — Confirm Node.js is installed system-wide

The MCP server is launched via npx. If you completed the Playwright MCP setup, Node.js is already installed system-wide and this step is done. Otherwise:

bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Verify: which npx — expected output: /usr/bin/npx

Step 3 — Add Brave Search MCP server in LiteLLM UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)

FieldValue
Namebrave-search
Aliasbrave-search
Transport TypeStandard Input/Output (stdio)

Set Stdio Configuration (JSON) — replace YOUR_BRAVE_API_KEY with your actual key:

json
{
  "command": "npx",
  "args": [
    "-y",
    "@modelcontextprotocol/server-brave-search"
  ],
  "env": {
    "BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
  }
}

Save and confirm Health Status shows Healthy.

If the health check fails, verify that npx resolves to /usr/bin/npx (system-wide install) and not a user-local path. See the Known Issues section — MCP stdio servers fail health check — for the full diagnosis.

Validation

In Open WebUI, send the following prompt:

Use the brave-search tool to find the latest news about NVIDIA and summarize the top three results.

Expected: the model calls the brave_web_search tool (shown as "Explored" in Open WebUI) and returns a summary drawn from live search results.

PLAYWRIGHT MCP

Browser Automation

Adds browser automation tool use to the stack — the AI model can navigate pages, take screenshots, and scrape content via tool calls in Open WebUI and other clients. LiteLLM spawns a headless Chromium process on demand via stdio; no persistent port is required.

Step 1 — Install Node.js system-wide

LiteLLM runs as a systemd service and does not source .bashrc. Node.js must be installed system-wide so npx is available in LiteLLM's PATH.

bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
Verify: which npx — expected output: /usr/bin/npx
If Node.js was previously installed via nvm or a user-local installer, this step replaces it with a system-wide install. The Known Issues section documents the npx PATH problem in detail.

Step 2 — Install Playwright MCP Chromium browser

Chrome has no ARM64 build. Use Chromium, installed via the @playwright/mcp package's own browser installer — not via npx playwright install:

bash
npx @playwright/mcp install-browser chromium
Verify: ls ~/.cache/ms-playwright/ — expected: a chromium-XXXX directory is present.

Step 3 — Update litellm-config.yaml

Add model_info blocks to all model entries. Without these, LiteLLM does not advertise function calling support and tool calls will not execute.

yaml
model_list:
  - model_name: qwen3.5-122b
    litellm_params:
      model: openai/qwen3.5-122b
      api_base: http://localhost:8000/v1
      api_key: "not-needed"
      max_tokens: 8192
    model_info:
      supports_function_calling: true
      supports_tool_choice: true

Restart LiteLLM after saving:

bash
sudo systemctl restart litellm

Step 4 — Add Playwright MCP server in LiteLLM UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)

FieldValue
Nameplaywright
Aliasplaywright
Transport TypeStandard Input/Output (stdio)

Set Stdio Configuration (JSON):

json
{
  "command": "npx",
  "args": [
    "-y",
    "@playwright/mcp@latest",
    "--browser",
    "chromium",
    "--headless"
  ]
}

Save and confirm Health Status shows Healthy.

Validation

In Open WebUI, send the following prompt:

Use the playwright tool to browse to cnn.com, take a screenshot, and tell me what you see.

Expected: the model calls the navigate and screenshot tools (shown as "Explored" in Open WebUI) and returns a summary of the page.

TROUBLESHOOTING

LiteLLM Admin UI

The LiteLLM proxy ships with a built-in web UI at /ui. It requires a master key and a PostgreSQL database — SQLite is not supported for the UI auth layer. The following documents every error encountered during setup, in order.

Step 1 — Set a master key

All commands below run on spark-01 (where LiteLLM lives). Add to ~/sparky-ai-stack/litellm-config.yaml:

yaml
general_settings:
  master_key: sk-yourkey
  database_url: "postgresql://litellm:litellm@localhost:5432/litellm"

Generate a secure key:

bash
echo "sk-$(openssl rand -hex 16)"

Step 2 — Add PostgreSQL to a compose file on spark-01

Run Postgres on the same node as LiteLLM. Putting it on spark-02 would re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Add to a new ~/sparky-ai-stack/litellm-db.yml on spark-01:

yaml
  litellm-db:
    image: postgres:16
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm
      - POSTGRES_DB=litellm
    ports:
      - "5432:5432"
    volumes:
      - litellm_db:/var/lib/postgresql/data

volumes:
  litellm_db:
bash
docker compose up -d litellm-db
restart: unless-stopped combined with sudo systemctl enable docker ensures the container survives reboots automatically — no additional systemd unit needed.

Step 3 — Install Prisma

LiteLLM uses Prisma as its database ORM. It is not included in the base pip install litellm package.

bash
pip install prisma --break-system-packages

--break-system-packages bypasses a Python 3.12 restriction that prevents pip from installing into the system Python environment. It is safe on a dedicated AI server where system tools do not depend on conflicting packages.

Step 4 — Generate Prisma binaries

After installing the package, the binaries must be generated from LiteLLM's bundled schema:

bash
cd ~/.local/lib/python3.12/site-packages/litellm/proxy
prisma generate --schema schema.prisma

Step 5 — Apply the database schema

The Postgres database exists but has no tables yet. Push the schema. DATABASE_URL must be passed inline — Prisma reads it directly from the environment, not from litellm-config.yaml.

bash
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push --schema schema.prisma

Step 6 — Restart LiteLLM

bash
sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm

Errors encountered in order

ErrorCauseFix
Authentication Error, Not connected to DBNo PostgreSQL configuredAdd database_url to general_settings
ModuleNotFoundError: No module named 'prisma'Prisma not installedpip install prisma --break-system-packages
Unable to find Prisma binariesprisma generate not runRun prisma generate --schema schema.prisma
The table 'public.LiteLLM_UserTable' does not existSchema not applied to DBRun prisma db push --schema schema.prisma

Accessing the UI

Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui. Username: admin. Password: your master_key value.

UI loads and accepts login with master key credentials