NVIDIA DGX
At-Home AI Stack — split-trust shared compute
Two clustered Nvidia DGX Spark nodes (arm64, Ubuntu 24.04) sharing a 256 GB unified memory pool through tensor parallelism (TP=2) over Ray on a 200 Gb/s direct-attach copper interconnect — but with split ownership. spark-01 is your node (your LiteLLM, your Open WebUI, your Hermes Agent, your n8n, your Tailscale). spark-02 is the client's node (their LiteLLM, their Open WebUI, their n8n, their Tailscale). Both LiteLLM proxies talk to the shared vLLM endpoint at spark-01:8000 over the DAC; neither application stack sees the other. Read the Trust model section before deploying — this architecture has specific properties at the API layer that you should understand explicitly.
Architecture
Two physically separate DGX Spark nodes share a single tensor-parallel vLLM cluster (TP=2 over Ray on a 200 Gb/s DAC link) — but each node runs its own independent application stack owned by a different party. The diagram shows three logical layers: application stacks (top, separate per owner), LiteLLM proxies (middle, one per side, separate keys and logs), and the shared compute pool (bottom, TP=2 across both nodes, served by the vLLM head on spark-01:8000). Tailscale sits as a separate overlay on each node — the DAC link is its own private hardware and does not traverse Tailscale.
Trust model
Read this before deploying. The split-trust architecture has specific properties at the API layer that you should understand explicitly. None of this is a new risk introduced by the cluster — it is just the same trust profile you accept any time you use a hosted inference API, made visible.
API-layer visibility — spark-01 sees all prompts
The vLLM head process runs on spark-01 and serves the OpenAI-compatible API on port 8000. Both LiteLLM proxies — yours and the client's — call this endpoint. That means the owner of spark-01 can, in principle, observe every raw prompt and every model output that crosses the API surface. This is structurally identical to the trust profile of any commercial hosted-inference provider (OpenAI, Anthropic, Together, etc.): the entity running the API server can see traffic at the API layer.
Tensor-layer isolation — spark-02 sees only floats
The Ray worker on spark-02 processes tensor activations, not text. It receives intermediate floating-point tensors over NCCL allreduce on the DAC link and contributes its share of the matrix multiplications. The client's node never sees readable prompts or completions; it only sees the mathematical operations its TP rank is responsible for. NCCL traffic on the DAC carries floats, not strings.
Application-layer isolation — fully separate stacks
Knowledge bases, chat history, RAG pipelines, vector indexes, API keys, request logs, and OAuth tokens are completely separate on each node. Your Open WebUI's database is on spark-01; the client's is on spark-02. Your LiteLLM master key is yours; the client's is the client's. Neither party has access to the other's application stack — there is no cross-mounted volume, no shared Postgres, no shared file system. The only thing that crosses the boundary is the inference call from the client's LiteLLM into spark-01:8000.
Network isolation — separate tailnets, private DAC
Each node joins its owner's Tailscale tailnet independently. ACLs on each tailnet are controlled by that owner. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not routed through either Tailscale network and is not advertised on either tailnet. Tailscale carries application traffic only (clients reaching their own UIs); compute traffic stays on the DAC.
When this architecture is appropriate
- Both parties have a working relationship and have agreed to this arrangement.
- The data being sent for inference is not regulated (HIPAA / GDPR / SOC2 / PCI / etc.).
- Both parties accept a trust model equivalent to using any commercial hosted-inference API.
When additional agreements are required
- Either party handles regulated data — HIPAA, GDPR, SOC2, PCI-DSS, attorney-client privileged, or similar — in which case a written data processing agreement (DPA / BAA / equivalent) and audit controls are needed before traffic flows.
- Either party has contractual data handling requirements imposed by their own customers or regulators.
- The relationship is not pre-existing and the trust profile of "any hosted inference API" is not acceptable.
spark-01:8000 has no authentication. Tailscale ACLs and host firewall rules are what prevent the client from bypassing their LiteLLM and hitting the unauthenticated endpoint directly. See Step 06 (Tailscale) for the ACL configuration that enforces this.Hardware topology
| Node | Owner | Mgmt IP | DAC IP | Services |
|---|---|---|---|---|
spark-01 |
You (private) | 192.0.2.21 |
198.51.100.1 |
vLLM head (Ray master, TP rank 0), Your LiteLLM, Your Open WebUI, Your Hermes Agent, Your n8n, Your Tailscale |
spark-02 |
Client (separate ownership) | 192.0.2.22 |
198.51.100.2 |
vLLM Ray worker (TP rank 1), Client LiteLLM, Client Open WebUI, Client n8n, Client Tailscale |
Interconnects
- DAC interconnect —
enp1s0f0np0, MTU 9216, point-to-point198.51.100.0/30. Carries NCCL for tensor-parallel collectives, Ray control, and the client LiteLLM's inference calls intospark-01:8000. Not routed through either Tailscale network. - Mgmt interconnect —
192.0.2.0/24over RJ45, default routes. Used for SSH and node-bootstrap traffic during setup. - Tailscale (each owner) — each node independently joins its owner's tailnet. Application traffic (browser → Open WebUI, Telegram → Hermes, etc.) traverses Tailscale. The DAC link is never advertised onto either tailnet.
- SSH — passwordless both directions between
spark-01andspark-02at the mgmt IPs (required for the rsync step in Step 01). After setup, this can be locked down or removed.
Architecture principles
- Shared compute, split application. The vLLM cluster is the only shared resource. Application stacks above it (LiteLLM, Open WebUI, Hermes, n8n, Tailscale) are duplicated and independently owned.
- Both LiteLLMs hit the same vLLM endpoint. Your LiteLLM uses
http://localhost:8000/v1; client's LiteLLM useshttp://198.51.100.1:8000/v1over the DAC. Neither proxy goes through the other's stack. - Separate keys, separate logs, separate data. Each LiteLLM has its own master key; each Open WebUI has its own knowledge bases and chat history. Nothing is shared at the application layer.
- Tailscale is per-owner. Two separate tailnets, two separate ACL policies. Cross-tailnet traffic only happens if both owners explicitly configure it (which by default they do not).
- Single-instance per side. No clustered Open WebUI / clustered n8n on either node. HA modes are documented in the appendix only.
Network worksheet
Fill these in once. Every code block on this page that contains a matching placeholder (YOUR_NODE1_MGMT_IP, YOUR_USERNAME, etc.) will be live-substituted with the value you type — and a yellow highlight shows you what was filled in. Values are saved to your browser's localStorage so reloads keep them. Master keys, API keys, and other secrets are deliberately not in this worksheet — fill those into the relevant code blocks manually so they never touch localStorage.
spark-01 — your node
spark-02 — client node
Shared / per-host
YOUR_MASTER_KEY, YOUR_CLIENT_MASTER_KEY, and YOUR_BRAVE_API_KEY are intentionally not in this worksheet — fill those into the relevant code blocks by hand, and don't paste them into a browser-stored field. The worksheet only handles network identifiers and your username.Prerequisites
- Two Nvidia DGX Spark nodes — Grace CPU, GB10 GPU, arm64/aarch64, each running Ubuntu 24.04
- Each node referred to as
spark-01(your node) andspark-02(client node) — substitute your own hostnames - Both parties have read and accepted the Trust model section above
- Docker installed and enabled on both nodes:
sudo systemctl enable docker - Your Linux user added to the
dockergroup on both nodes:sudo usermod -aG docker YOUR_USERNAME && newgrp docker - 200 Gb/s DAC link between the two nodes (interface
enp1s0f0np0on both, MTU 9216, point-to-point /30) - Mgmt LAN reachability between both nodes (1 GbE RJ45 with default routes)
- Passwordless SSH both directions (
spark-01 ↔ spark-02) — required for the HF cache rsync in Step 01 - Mgmt-IP entries in
/etc/hostson both nodes so hostnames resolve to mgmt addresses, not the DAC IP (commands below) - Replace
YOUR_USERNAMEwith your Linux username throughout - Replace
YOUR_NODE1_MGMT_IP/YOUR_NODE2_MGMT_IPwith each node's mgmt IP, andYOUR_NODE1_DAC_IP/YOUR_NODE2_DAC_IPwith each node's DAC IP
Bootstrap on both nodes — /etc/hosts and docker group
By default, the hostname of each node resolves to its DAC IP (198.51.100.x), not the mgmt IP. SSH from one node to the other by hostname will fail until you anchor the hostnames to mgmt IPs explicitly.
#### Run on both spark-01 AND spark-02
# Add mgmt-IP entries for both nodes
echo "YOUR_NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts
echo "YOUR_NODE2_MGMT_IP spark-02" | sudo tee -a /etc/hosts
# Add your user to the docker group (then re-login or use newgrp)
sudo usermod -aG docker YOUR_USERNAME
newgrp docker
# Verify SSH by hostname both directions
ssh spark-01 hostname # from spark-02
ssh spark-02 hostname # from spark-01
/etc/hosts step, the rsync of the Hugging Face cache between nodes (Step 01) and any later ssh spark-0X command will silently target the DAC interface — which won't have sshd bound to it unless you've changed defaults. The symptom is a "connection refused" or hang.vLLM clustered — TP=2 over Ray on the DAC link
vLLM is the only clustered service. The model runs with tensor-parallel size 2: spark-01 hosts the Ray master and the vLLM head process; spark-02 hosts a Ray worker. NCCL traffic for tensor-parallel collectives flows over the DAC link (enp1s0f0np0, MTU 9216).
Production model and bootstrap fallback
| Track | Model | Notes |
|---|---|---|
| Production (default) | Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 |
The intended daily driver. Larger MoE (122B / A10B), GPTQ-Int4 quantized — meaningfully better quality than the 35B, still fits comfortably in the clustered 256 GB pool (~34 GB weights per node + KV cache). At --gpu-memory-utilization 0.80 the RayWorkerWrapper shows roughly ~96 GB resident on each node (weights + KV cache). On a cold cache, the GPTQ-Marlin JIT compile adds ~10–20 minutes to first launch — see Cluster issues. |
| Bootstrap fallback | Qwen/Qwen3.6-35B-A3B-FP8 |
The model used to bring the cluster up the first time. 35B MoE / A3B activation, FP8 quantized. Useful for fast iteration on cluster wiring (Ray, NCCL, DAC) before committing to the longer 122B load. Switching back is a flag change — see "Bootstrap fallback" below. |
| Tested but does not fit | Qwen/Qwen3-235B-A22B-FP8 |
Does not fit. 235B at FP8 ≈ 235 GB total ≈ 117.5 GB per node — leaves no room for KV cache or OS on a 121 GB usable per-node pool. Ray kills the worker with an OutOfMemoryError at the 95% memory threshold. Use a GPTQ-Int4 quantization (a 235B INT4 lands at ~58 GB/node) or, for daily use, the 122B above. |
Step 1a — Build the custom vLLM image on both nodes
The NGC image nvcr.io/nvidia/vllm:26.04-py3 ships without Ray. vLLM 0.19.0+nv26.04 hard-requires Ray for any multi-node inference — torch.distributed.run is not a substitute, because vLLM validates Ray at engine init regardless of launch method. You must build a custom image that adds Ray, on both nodes, before any cluster launch attempt. This is step zero, not optional.
#### Run on both spark-01 AND spark-02
mkdir -p ~/sparky-ai-stack
cat > ~/sparky-ai-stack/Dockerfile.vllm-spark << 'EOF'
FROM nvcr.io/nvidia/vllm:26.04-py3
RUN pip install ray --quiet
EOF
cd ~/sparky-ai-stack
docker build -f Dockerfile.vllm-spark -t vllm-spark:26.04 .
docker run --rm vllm-spark:26.04 ray --version — expected ray, version 2.x.xStep 1b — Sync the Hugging Face cache to spark-02 over DAC
Both nodes need the model weights resident locally. Pull on spark-01 first (or use an existing ~/.cache/huggingface), then rsync to spark-02 across the DAC link.
#### spark-01
# Pre-fetch the production 122B GPTQ-Int4 weights on spark-01.
# Run as your normal user (e.g. cameron) — NOT root. See callout below.
hf download Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
--local-dir ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-GPTQ-Int4
# Rsync the cache to spark-02 over the DAC link
rsync -avh --progress \
-e "ssh -o StrictHostKeyChecking=accept-new" \
~/.cache/huggingface/ \
YOUR_NODE2_DAC_IP:/home/YOUR_USERNAME/.cache/huggingface/
enp1s0f0np0 on spark-02 with nload enp1s0f0np0 in another shell during the transfer.hf download as your normal user, not root. The legacy huggingface-cli command is deprecated; use hf download from the new huggingface_hub CLI. If you previously ran the download as root, the target directory will be root-owned and subsequent runs as a normal user fail with permission denied. Fix: sudo rm -rf ~/.cache/huggingface/hub/models--Qwen--Qwen3.5-122B-A10B-GPTQ-Int4, then recreate it and rerun the download as your user.Step 1c — Startup scripts on each node
Place a startup script on each node so the cluster can be brought up reproducibly.
#### spark-01 — ~/sparky-ai-stack/scripts/vllm-head.sh
mkdir -p ~/sparky-ai-stack/scripts
cat > ~/sparky-ai-stack/scripts/vllm-head.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
# DAC IP of THIS node (spark-01)
HOST_IP="YOUR_NODE1_DAC_IP"
DAC_IFACE="enp1s0f0np0"
docker rm -f vllm-qwen-122b 2>/dev/null || true
docker run -d \
--name vllm-qwen-122b \
--network host \
--ipc host \
--gpus all \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_HOST_IP="${HOST_IP}" \
-e NCCL_SOCKET_IFNAME="${DAC_IFACE}" \
-e GLOO_SOCKET_IFNAME="${DAC_IFACE}" \
-e RAY_ADDRESS="${HOST_IP}:6379" \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
-v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache" \
--restart unless-stopped \
vllm-spark:26.04 \
bash -lc '
set -e
ray start --head \
--node-ip-address="'"${HOST_IP}"'" \
--port=6379 \
--dashboard-host=0.0.0.0 \
--num-gpus=1 \
--block &
# Was 20s previously — bumped to 60s so a simultaneous power-loss reboot of
# both nodes still gives the worker container time to come up and register
# before vLLM begins engine init. The until-loop below is the real guard,
# but the longer initial sleep avoids racing the loop on a cold boot.
sleep 60
until ray status >/dev/null 2>&1; do sleep 1; done
echo "[head] ray up, waiting for worker to join..."
until [ "$(ray status 2>/dev/null | grep -c '"'"'1.0/1.0 GPU'"'"')" -gt 1 ] || \
[ "$(ray status 2>/dev/null | grep -c '"'"'GPU'"'"')" -ge 2 ]; do sleep 2; done
echo "[head] worker joined, starting vllm serve"
exec vllm serve Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
--served-model-name qwen3.5-122b \
--dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--max-num-batched-tokens 4096 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-seqs 32 \
--host 0.0.0.0 \
--port 8000 \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '"'"'{"enable_thinking": false}'"'"' \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--distributed-executor-backend ray
'
EOF
chmod +x ~/sparky-ai-stack/scripts/vllm-head.sh
#### spark-02 — ~/sparky-ai-stack/scripts/vllm-worker.sh
mkdir -p ~/sparky-ai-stack/scripts
cat > ~/sparky-ai-stack/scripts/vllm-worker.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
# DAC IPs
WORKER_IP="YOUR_NODE2_DAC_IP" # this node (spark-02)
HEAD_IP="YOUR_NODE1_DAC_IP" # ray head on spark-01
DAC_IFACE="enp1s0f0np0"
docker rm -f vllm-ray-worker 2>/dev/null || true
docker run -d \
--name vllm-ray-worker \
--network host \
--ipc host \
--gpus all \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_HOST_IP="${WORKER_IP}" \
-e NCCL_SOCKET_IFNAME="${DAC_IFACE}" \
-e GLOO_SOCKET_IFNAME="${DAC_IFACE}" \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
-v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache" \
--restart unless-stopped \
vllm-spark:26.04 \
bash -lc '
# Retry forever until the head is reachable; this is fine and expected
until ray start \
--address="'"${HEAD_IP}"':6379" \
--node-ip-address="'"${WORKER_IP}"'" \
--num-gpus=1 \
--block; do
echo "[worker] head not yet reachable, retrying in 3s..."
sleep 3
done
'
EOF
chmod +x ~/sparky-ai-stack/scripts/vllm-worker.sh
--network host and --ipc host so NCCL and Ray see the real DAC interface and shared memory. VLLM_HOST_IP is set to each node's DAC IP — see the warning under Step 1d for why this is required.Step 1c-bis — Persist the vLLM compile cache (both nodes)
On first launch, GPTQ-Marlin and torch.compile run JIT compilation that takes 10–20 minutes silently per node — the log will appear to stall after the engine reports weights loaded. The compiled artifacts land at /root/.cache/vllm/torch_compile_cache inside the container. Because the container is ephemeral (we recreate it on every vllm-head.sh / vllm-worker.sh run), that cache is lost every time and the cluster re-pays the full compile cost on every restart.
Fix: mount a host directory into the container so the cache survives container recreation. The two docker run commands above already include the mount:
-v "${HOME}/sparky-ai-stack/vllm-compile-cache:/root/.cache/vllm/torch_compile_cache"
Create the host directory on each node before the first launch, otherwise Docker will create it root-owned and subsequent unprivileged access will fail:
mkdir -p ~/sparky-ai-stack/vllm-compile-cache # run on BOTH spark-01 and spark-02
rm -rf ~/sparky-ai-stack/vllm-compile-cache/* — or you'll see kernel-shape mismatches at engine init.Step 1d — Launch order: worker first, then head
Start the worker container first, then the head. This ordering is intentional:
- The worker's
ray start --address …:6379will retry forever until the head's GCS comes up — this is the expected path and is harmless. - If the head starts first and vLLM begins engine init before the worker has joined Ray, the placement group fires with only one GPU visible and the run hangs in an indefinite allocation failure with no clean error.
- The head script above blocks
vllm servebehind a Ray-status check that waits for two GPUs to be registered, which makes the launch idempotent.
#### spark-02 — start the worker
~/sparky-ai-stack/scripts/vllm-worker.sh
docker logs -f vllm-ray-worker # leave open in another shell
Wait until the worker logs print Ray runtime started and stop spamming "head not yet reachable" retries (the head isn't up yet, but the container will keep trying).
#### spark-01 — start the head
~/sparky-ai-stack/scripts/vllm-head.sh
docker logs -f vllm-qwen-122b
The head will: (1) start Ray as the master, (2) wait for the worker to register a second GPU into the cluster, then (3) start vllm serve. Allow ~5–8 minutes for the GPTQ-Int4 weights to load on both nodes before the engine is ready. On a cold cache the GPTQ-Marlin JIT compile silently adds another 10–20 minutes after weight loading completes — the log will appear to stall after ray_env.py:111 while the kernels compile inside the RayWorkerWrapper. This is expected; see Cluster issues. Compiled artifacts are cached so subsequent startups skip this step.
Step 1e — Verification
#### spark-01 — Ray cluster status
docker exec vllm-qwen-122b ray status
2.0/2.0 GPU, both DAC IPs listed (YOUR_NODE1_DAC_IP and YOUR_NODE2_DAC_IP)#### spark-01 — model endpoint
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'
{"data":[{"id":"qwen3.5-122b",...}]} on the first call, a streamed completion on the second#### Both nodes — GPU residency (GB10 quirk)
GB10 (Grace Blackwell) uses unified memory. The standard --query-gpu=memory.used,memory.total fields return [N/A] on this hardware — that is expected, not a bug. Use plain nvidia-smi and read the Processes section instead:
nvidia-smi # run on each node
You should see a RayWorkerWrapper (or vllm) process on each node with roughly ~96 GB resident at --gpu-memory-utilization 0.80 with the 122B GPTQ-Int4 model loaded (≈34 GB of weights per node + KV cache).
#### spark-02 — NCCL traffic on the DAC during inference
nload enp1s0f0np0 # watch this while the curl chat-completion above runs
You should see Gb/s-class spikes on the DAC during decoding. If you see traffic on a different interface, NCCL_SOCKET_IFNAME didn't propagate — see the cluster troubleshooting section.
Bootstrap fallback — the 35B FP8 model
The Qwen/Qwen3.6-35B-A3B-FP8 model is the original bootstrap model and remains useful for fast iteration on cluster wiring (Ray, NCCL, DAC) before committing to the longer 122B load. If the cache is already populated with the 35B FP8 weights, swap the vllm serve line in ~/sparky-ai-stack/scripts/vllm-head.sh and the container name (vllm-qwen-122b → vllm-qwen-35b):
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen3.6-35b \
--dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 131072 \
--max-num-batched-tokens 4096 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-chunked-prefill \
--enable-prefix-caching \
--max-num-seqs 32 \
--host 0.0.0.0 \
--port 8000 \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--distributed-executor-backend ray
Update --served-model-name in the LiteLLM config in Step 02 (and the client's LiteLLM in Step 03) if you fall back. Expect ~97 GB resident per node at FP8 instead of ~96 GB at GPTQ-Int4.
Your LiteLLM proxy on spark-01
This is your LiteLLM proxy — your master key, your SQLite log corpus, your routing rules. It serves only your application stack on spark-01 (your Open WebUI, your Hermes, your n8n). The client gets their own separate LiteLLM in Step 03.
Your LiteLLM lives on the same node as the vLLM head and points at localhost:8000. The clustered vLLM presents one logical OpenAI-compatible endpoint — LiteLLM doesn't need to know there are two physical nodes behind it.
#### spark-01 — directories and install
mkdir -p ~/sparky-ai-stack/logs
cd ~/sparky-ai-stack
pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
~/.local/bin/litellm --version#### spark-01 — config (single backend pointing at the local clustered vLLM)
cat > ~/sparky-ai-stack/litellm-config.yaml << 'EOF'
model_list:
- model_name: qwen3.5-122b
litellm_params:
model: openai/qwen3.5-122b
api_base: http://localhost:8000/v1
api_key: "not-needed"
litellm_settings:
verbose: true
database:
type: sqlite
path: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.db
log_config:
level: INFO
format: json
filepath: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.log
router_settings:
num_retries: 0
timeout: 600
EOF
localhost:8002 for an "expert" model), delete every model and fallback entry that points at a port no longer running. LiteLLM will throw httpx.ConnectError on startup if any configured backend is unreachable.#### spark-01 — systemd service
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/sparky-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
#### spark-01 — clean stale override.conf (only if upgrading from a previous setup)
If LiteLLM was previously run with STORE_MODEL_IN_DB=True and a DATABASE_URL in a systemd drop-in, those env vars persist even after you remove them from litellm-config.yaml. LiteLLM will fail to start with httpx.ConnectError against a Postgres that may no longer exist. Reset the drop-in:
sudo mkdir -p /etc/systemd/system/litellm.service.d
sudo bash -c 'cat > /etc/systemd/system/litellm.service.d/override.conf << EOF
[Service]
Environment=PYTHONPATH=/home/YOUR_USERNAME/.local/lib/python3.12/site-packages
EOF'
sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
curl http://localhost:8001/v1/models (no key needed if you didn't set master_key yet, otherwise use -H "Authorization: Bearer YOUR_MASTER_KEY")spark-02. If you are using Tailscale ACLs (Step 06), only your tailnet should reach spark-01:8001. The client uses their own LiteLLM (Step 03) — they never call yours.Step 2d — PostgreSQL for the Admin UI and virtual keys (required)
The LiteLLM Admin UI and virtual-key generation both require PostgreSQL. SQLite is not supported — LiteLLM's Prisma schema is hardcoded for PostgreSQL, and the UI route returns table public.LiteLLM_UserTable does not exist if you try to point it at SQLite. The SQLite block in the config above is fine for request logging; it is not a substitute for the metadata DB.
#### spark-01 — bring up the litellm-db postgres container
On spark-01 the postgres container lives in ~/sparky-ai-stack/docker-compose.yml alongside n8n and hermes-webui:
services:
litellm-db:
image: postgres:16
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
volumes:
- litellm_db:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
litellm_db:
cd ~/sparky-ai-stack
docker compose up -d litellm-db
docker compose ps litellm-db
#### spark-01 — apply the Prisma schema
Install the Prisma CLI if missing, then push the LiteLLM Prisma schema into the new database. This must run once on spark-01 before LiteLLM starts, otherwise the UI will return table public.LiteLLM_UserTable does not exist.
pip install prisma --break-system-packages
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push \
--schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Your database is now in sync with your Prisma schema. Done in <Ns>#### spark-01 — wire the database into litellm-config.yaml
Add the database_url to general_settings (not at the top level — see troubleshooting) and enable model-in-DB storage so the UI can edit the model list:
general_settings:
master_key: YOUR_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
litellm_settings:
store_model_in_db: true
Restart and verify:
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
curl -s http://localhost:8001/health/readiness | head
http://spark-01:8001/ui — log in with admin and your master key. Generate per-service virtual keys from the Virtual Keys tab (one for Open WebUI, one for n8n, one for Hermes — never paste the master key into a downstream service).Client LiteLLM proxy on spark-02
The client gets their own LiteLLM proxy on spark-02, with their own master key, their own log corpus, and their own routing rules. It points at the shared vLLM endpoint over the DAC link. This is not a copy of spark-01's LiteLLM — it has no shared config, no shared key, no shared logs. The client controls their own master key and never shares it with you.
spark-02 on behalf of the client, hand off the master-key generation step (or have them rotate the key the moment they take over). The point of split trust is that you do not hold the client's API credentials.#### spark-02 — install
mkdir -p ~/sparky-ai-stack/logs
cd ~/sparky-ai-stack
pip3 install litellm[proxy] --break-system-packages
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
#### spark-02 — generate the client master key
Run on the client's terminal — store this key only on spark-02:
echo "sk-client-$(openssl rand -hex 16)"
#### spark-02 — config (points at vLLM over the DAC)
cat > ~/sparky-ai-stack/litellm-config.yaml << 'EOF'
model_list:
- model_name: qwen3.5-122b
litellm_params:
model: openai/qwen3.5-122b
api_base: http://YOUR_NODE1_DAC_IP:8000/v1 # shared vLLM, over DAC
api_key: "not-needed"
litellm_settings:
verbose: true
database:
type: sqlite
path: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.db
log_config:
level: INFO
format: json
filepath: /home/YOUR_USERNAME/sparky-ai-stack/logs/litellm.log
general_settings:
master_key: YOUR_CLIENT_MASTER_KEY # set to the sk-client-... value above
router_settings:
num_retries: 0
timeout: 600
EOF
YOUR_NODE1_MGMT_IP:8000 unless the DAC is down.#### spark-02 — systemd service (independent of spark-01)
sudo tee /etc/systemd/system/litellm.service << 'EOF'
[Unit]
Description=LiteLLM Proxy (client)
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/sparky-ai-stack
ExecStart=/home/YOUR_USERNAME/.local/bin/litellm --config litellm-config.yaml --port 8001 --host 0.0.0.0
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable litellm
sudo systemctl start litellm
sudo systemctl status litellm --no-pager
Verification — confirm the request hits spark-01:8000
#### spark-02 — local LiteLLM responds with the client key
curl http://localhost:8001/v1/models \
-H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"
#### spark-02 — inference round-trip through the shared backend
curl http://localhost:8001/v1/chat/completions \
-H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3.5-122b","messages":[{"role":"user","content":"hi from client"}],"max_tokens":16}'
tail -f ~/sparky-ai-stack/logs/litellm.log shows nothing — your LiteLLM is not in the path. Instead, run docker logs --tail 20 vllm-qwen-122b on spark-01 — you should see the new request reach the vLLM head.Step 3d — PostgreSQL for the client Admin UI and virtual keys (required)
Same constraint as Step 02: the LiteLLM Admin UI and virtual-key generation require PostgreSQL — SQLite is not supported because LiteLLM's Prisma schema is hardcoded for PostgreSQL. spark-02 doesn't run a docker-compose stack, so we use a standalone postgres container.
#### spark-02 — standalone postgres container
docker run -d --name litellm-db --restart unless-stopped \
-e POSTGRES_USER=litellm \
-e POSTGRES_PASSWORD=litellm \
-e POSTGRES_DB=litellm \
-p 5432:5432 \
-v litellm_db:/var/lib/postgresql/data \
postgres:16
#### spark-02 — apply the Prisma schema
pip install prisma --break-system-packages
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push \
--schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma
Your database is now in sync with your Prisma schema.#### spark-02 — wire the database into the client litellm-config.yaml
database_url goes under general_settings, alongside the existing master_key. Add store_model_in_db: true under litellm_settings:
general_settings:
master_key: YOUR_CLIENT_MASTER_KEY
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
litellm_settings:
store_model_in_db: true
sudo systemctl restart litellm
sudo systemctl status litellm --no-pager
http://spark-02:8001/ui — log in with admin and the client master key. The client generates their own per-app virtual keys from the Virtual Keys tab; you never see them.prisma db push), the old key will appear in the UI but lookups will fail with Virtual key not found in LiteLLM_VerificationTokenTable. Delete it in the UI, restart LiteLLM, then generate a new one.Your Open WebUI on spark-01
Your daily-driver chat interface, owned by you, on your node. It points at your LiteLLM at http://localhost:8001/v1. The client gets their own Open WebUI on spark-02 in Step 05 — neither side can see the other's chat history, knowledge bases, RAG documents, or API keys.
#### spark-01 — directory and run
mkdir -p ~/sparky-ai-stack
cd ~/sparky-ai-stack
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="YOUR_MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Visit http://localhost:8080 from spark-01 (or via your tailnet — see Step 06), create your admin account, and confirm in Settings → Connections → OpenAI API:
| Setting | Value |
|---|---|
| API Base URL | http://host.docker.internal:8001/v1 |
| API Key | your master_key |
| Default model | qwen3.5-122b |
| Memory | Toggle ON (Settings → Personalization) |
Client Open WebUI on spark-02
The client's daily-driver chat interface, owned by the client, on the client's node. It points at the client's LiteLLM at http://localhost:8001/v1 — which in turn calls the shared vLLM head on spark-01:8000 over the DAC.
The client's data lives on the client's node. Their Open WebUI database, knowledge bases, RAG document store, embedding indexes, conversation history, attached files, and account list — all of it is in the open-webui Docker volume on spark-02. None of it is replicated to spark-01. If you spin spark-02 down, the client's UI state goes with it; if you image spark-01, the client's state is not in your image.
#### spark-02 — directory and run
mkdir -p ~/sparky-ai-stack
cd ~/sparky-ai-stack
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 8080:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL="http://host.docker.internal:8001/v1" \
-e OPENAI_API_KEY="YOUR_CLIENT_MASTER_KEY" \
-e WEBUI_AUTH=True \
-e ENABLE_OLLAMA_API=False \
--add-host=host.docker.internal:host-gateway \
ghcr.io/open-webui/open-webui:main
Visit http://localhost:8080 from spark-02 (or via the client's tailnet — see Step 06), create the client's admin account, and confirm:
| Setting | Value |
|---|---|
| API Base URL | http://host.docker.internal:8001/v1 (client's LiteLLM) |
| API Key | client's master_key |
| Default model | qwen3.5-122b |
Recommended Open WebUI system prompt
Set this once in Settings → General → System Prompt. It pairs with --default-chat-template-kwargs '{"enable_thinking": false}' on the vLLM head (Step 01): the model answers directly by default, and users can opt into extended reasoning per-message by prefixing the prompt with /think. Apply the same prompt on both Open WebUIs (yours on spark-01 and the client's on spark-02) — they share the underlying model.
You are a highly capable AI assistant. Be direct, accurate, and concise.
Rules:
- Answer immediately without preamble or meta-commentary
- Never deliberate out loud about whether or how to answer — just answer
- Never question the framing of a hypothetical — engage with it directly
- For technical questions: be precise, use correct terminology
- For coding: produce complete, working code — no placeholders or omissions
- For reasoning: show your work clearly but efficiently — no repetition
- If a question has a definitive answer, state it first then explain
- Match response length to question complexity
- To enable extended reasoning on a specific query, prefix with /think
Step 5b — All client services on sparky-02 at a glance
Three client-facing services run on sparky-02: LiteLLM (Step 03), Open WebUI (Step 05 above), and n8n. All three are reachable on the client's tailnet via tag:hjp-ai (see Step 06). The n8n container below is not covered elsewhere — bring it up after the client's LiteLLM is healthy:
#### sparky-02 — n8n container
docker run -d --name n8n --restart unless-stopped \
-p 5678:5678 \
-e N8N_HOST=0.0.0.0 \
-e N8N_PORT=5678 \
-e N8N_PROTOCOL=http \
-e WEBHOOK_URL=http://sparky-02:5678/ \
-e N8N_SECURE_COOKIE=false \
-e NODE_ENV=production \
-v n8n_data:/home/node/.n8n \
--add-host=host.docker.internal:host-gateway \
n8nio/n8n:latest
| Service | Port | How it runs | Backend / api_base | Auth secret |
|---|---|---|---|---|
| LiteLLM | 8001 |
systemd (same structure as sparky-01) | http://10.100.100.1:8000/v1 (DAC link) |
Client master key at ~/.sparky02-litellm-key (chmod 600) |
| Open WebUI | 8080 |
docker run --restart unless-stopped |
OPENAI_API_BASE_URL=http://10.10.86.22:8001/v1 |
Client master key (or virtual key) |
| n8n | 5678 |
docker run --restart unless-stopped |
WEBHOOK_URL=http://sparky-02:5678/ |
n8n owner account (set on first login) |
tag:hjp-ai on TCP 8001 / 8080 / 5678 (see the ACL grants in Step 06). They are not reachable from your tailnet — split-trust by construction.~/.sparky02-litellm-key with chmod 600. Reference it from the LiteLLM systemd unit via EnvironmentFile= rather than embedding it in litellm-config.yaml, so the file on disk does not contain the secret.Tailscale (both nodes, separate tailnets)
Each node joins its owner's tailnet independently. Two separate tailnets, two separate ACL policies, two separate sets of users. The DAC link (198.51.100.0/30) is private physical hardware between the two nodes — it is not advertised onto either tailnet, and it is not used for any cross-tailnet routing.
spark-01:8000 has no authentication. Once Tailscale is configured, ensure your Tailscale ACLs do not expose port 8000 to the client's tailnet (and the client's ACLs do not expose your node's 8000 port to anyone either). The client must only reach spark-02:8001 (their own LiteLLM). If they can reach spark-01:8000 directly, they bypass their LiteLLM entirely and have unauthenticated inference access — which also means no key-scoped logging, no rate limit, and no audit trail.spark-01 — your tailnet
#### spark-01 — install + join
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --hostname=spark-01 --advertise-tags=tag:owner
tailscale ip -4 # note this address — your apps will be reachable here
In your tailnet's ACL policy (Tailscale admin console), expose your Open WebUI, your LiteLLM, and your other apps only to your own users. Example ACL fragment:
{
"acls": [
{ "action": "accept",
"src": ["group:owner-users"],
"dst": ["tag:owner:8080", "tag:owner:8001", "tag:owner:5678", "tag:owner:8787"]
},
{ "action": "accept",
"src": ["group:owner-users"],
"dst": ["tag:owner:22"]
}
],
"tagOwners": {
"tag:owner": ["YOU@example.com"]
},
"groups": {
"group:owner-users": ["YOU@example.com"]
}
}
Do NOT add tag:owner:8000 to any allow rule. Port 8000 (vLLM) is unauthenticated and must remain reachable only from localhost (your LiteLLM in Step 02) and the DAC IP 198.51.100.1 (the client's LiteLLM in Step 03).
spark-02 — client's tailnet
#### spark-02 — install + join (client's auth key)
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up \
--authkey=<client-auth-key> \
--hostname=sparky-02 \
--advertise-tags=tag:hjp-ai
tailscale ip -4 # client's apps will be reachable here, only on their tailnet
The client's tailnet ACLs are theirs to author. Mirror the structure above with their own users and tags. The client should expose :8080 (their Open WebUI), :8001 (their LiteLLM, only if they want programmatic access from elsewhere), and :5678 (their n8n).
sparky-02 in the client's Tailscale admin (Machines → sparky-02 → ⋯ → Disable key expiry). Server nodes shouldn't drop off the tailnet on a 90-day timer; the auth key rotation should be a deliberate action.#### Client tailnet ACL — replace the default allow-all
The default Tailscale policy allows everything between everyone. Replace it with this grants-based policy. It leaves all non-server devices unrestricted (so the client's existing fleet is untouched), keeps tag:hjp-dell-server open (their existing Dell server stays as-is), and restricts tag:hjp-ai (this node) to only the three service ports — 8001 (LiteLLM), 8080 (Open WebUI), 5678 (n8n).
{
"grants": [
{
"src": ["*"],
"dst": ["autogroup:member"],
"ip": ["*"]
},
{
"src": ["*"],
"dst": ["tag:hjp-dell-server"],
"ip": ["*"]
},
{
"src": ["*"],
"dst": ["tag:hjp-ai"],
"ip": ["tcp:8001", "tcp:8080", "tcp:5678"]
}
],
"tagOwners": {
"tag:hjp-ai": ["autogroup:admin"],
"tag:hjp-dell-server":["autogroup:admin"]
}
}
nc -zv sparky-02 8001, nc -zv sparky-02 8080, and nc -zv sparky-02 5678 all succeed. nc -zv sparky-02 22 (SSH) and nc -zv sparky-02 8000 (raw vLLM) both fail — proving the ACL is in effect.tag:hjp-ai is intentionally not given tcp:8000. Port 8000 is the raw, unauthenticated vLLM head on spark-01 reached over the DAC; it must never be exposed onto the client's tailnet. The client only ever calls their own LiteLLM on sparky-02:8001.Lock down host firewalls
Tailscale ACLs are policy; the host firewall is enforcement. Apply ufw rules on both nodes so that even if Tailscale is misconfigured, ports 8000 and 8001 cannot leak to the wrong network.
#### spark-01 — host firewall
# Allow from your tailnet (interface tailscale0) and DAC peer only
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp # your LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp # your Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp # your n8n
sudo ufw allow in on tailscale0 to any port 8787 proto tcp # your Hermes WebUI
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000 proto tcp # client LiteLLM → vLLM only
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 6379 proto tcp # Ray GCS over DAC
sudo ufw allow ssh # mgmt LAN ssh
sudo ufw enable
#### spark-02 — host firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 8001 proto tcp # client LiteLLM
sudo ufw allow in on tailscale0 to any port 8080 proto tcp # client Open WebUI
sudo ufw allow in on tailscale0 to any port 5678 proto tcp # client n8n
sudo ufw allow in on enp1s0f0np0 from YOUR_NODE1_DAC_IP # NCCL/Ray over DAC
sudo ufw allow ssh
sudo ufw enable
nc -zv spark-01-tailscale-ip 8000 should fail with "connection refused" or "filtered". From the client's tailnet, curl http://spark-02-tailscale-ip:8001/v1/models -H "Authorization: Bearer CLIENT_KEY" should succeed.Your Hermes Agent on spark-01
Hermes is your autonomous agent layer (skills, memory, cron, gateways). It runs on your node and talks to your LiteLLM. The client does not get a Hermes — they have their own Open WebUI and n8n on spark-02; if they want an agent they install their own.
#### spark-01 — install
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
hermes --version
#### spark-01 — setup wizard
hermes setup
| Prompt | Answer |
|---|---|
| Setup type | Full setup |
| Provider | Custom endpoint |
| API base URL | http://localhost:8001/v1 |
| API key | your master_key |
| Model | qwen3.5-122b |
| Terminal backend | Local |
| Session reset mode | Inactivity + daily reset |
| Search provider | Firecrawl Self-Hosted (or skip) |
| Launch chat now? | n |
The wizard writes ~/.hermes/config.yaml. base_url is local — Hermes and your LiteLLM both live on spark-01.
Telegram gateway
Create a bot via @BotFather (/newbot, copy the token) and get your user ID from @userinfobot. Then:
hermes setup gateway # select Telegram, paste token, paste user ID, choose System service
sudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --system
sudo systemctl start hermes-gateway
sudo systemctl status hermes-gateway --no-pager
Hermes WebUI (Docker, spark-01)
#### spark-01 — docker-compose for hermes-webui
cat > ~/sparky-ai-stack/hermes-webui.yml << 'EOF'
services:
hermes-webui:
image: ghcr.io/nesquena/hermes-webui:latest
container_name: hermes-webui
restart: unless-stopped
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
- WANTED_UID=1000
- WANTED_GID=1000
- HERMES_WEBUI_STATE_DIR=/home/hermeswebui/.hermes/webui-mvp
volumes:
- /home/YOUR_USERNAME/.hermes:/home/hermeswebui/.hermes
- /home/YOUR_USERNAME/workspace:/workspace
ports:
- "8787:8787"
EOF
mkdir -p ~/workspace
docker compose -f ~/sparky-ai-stack/hermes-webui.yml up -d
http://localhost:8787 on spark-01 (or your tailnet hostname).Your n8n on spark-01
Your single-instance n8n on spark-01. The client runs their own n8n on spark-02 independently — different workflows, different credentials, different Postgres-vs-SQLite state. Neither side can read the other's flows.
#### spark-01 — docker-compose for n8n
cat > ~/sparky-ai-stack/n8n.yml << 'EOF'
services:
n8n:
image: n8nio/n8n:latest
container_name: n8n
restart: unless-stopped
ports:
- "5678:5678"
environment:
- N8N_HOST=0.0.0.0
- N8N_PORT=5678
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://YOUR_TAILNET_HOSTNAME:5678/
- N8N_SECURE_COOKIE=false
- NODE_ENV=production
- GENERIC_TIMEZONE=America/Los_Angeles
volumes:
- n8n_data:/home/node/.n8n
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
n8n_data:
EOF
docker compose -f ~/sparky-ai-stack/n8n.yml up -d
http://localhost:5678 on spark-01 (or your tailnet hostname) and create the owner account.Wire your n8n to your LiteLLM
In n8n, add an OpenAI credential pointing at your LiteLLM (local on spark-01):
| Field | Value |
|---|---|
| API URL | http://host.docker.internal:8001/v1 |
| API Key | your master_key |
| Default model | qwen3.5-122b |
spark-02 follows the same pattern but points at their LiteLLM (http://host.docker.internal:8001/v1 from inside their n8n container) with the client's master key. Step 03 covers the client's LiteLLM; the client deploys their n8n the same way you deploy yours.Stack Validation Checklist
Both sides have to pass independently, and each side has to fail in the right places (you should not be able to reach the client's stack, and vice versa). Every item is labeled with the node it should be run from.
Shared compute pool
- spark-01:
docker exec vllm-qwen-122b ray statusshows 2 nodes and 2.0/2.0 GPU, with both DAC IPs (YOUR_NODE1_DAC_IPandYOUR_NODE2_DAC_IP) listed. - spark-01:
nvidia-smiProcesses section shows aRayWorkerWrapperat ~96 GB (122B GPTQ-Int4 at--gpu-memory-utilization 0.80). The--query-gpu=memory.used/memory.totalfields return[N/A]on GB10 — expected, use the Processes section instead. - spark-02:
nvidia-smiProcesses section shows aRayWorkerWrapperat ~96 GB. - spark-01:
curl http://localhost:8000/v1/modelsreturnsqwen3.5-122b(vLLM head).
Your side (spark-01)
- spark-01:
curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_MASTER_KEY"returnsqwen3.5-122bthrough your LiteLLM. - spark-01:
sudo systemctl status litellm hermes-gateway— both active (running). - spark-01:
docker psshowsvllm-qwen-122b,open-webui,hermes-webui, andn8nall Up. - spark-01: Open
http://localhost:8080(or your tailnet hostname), send a chat message — completion streams back. Your LiteLLM log records the request. - spark-01: Send a Telegram message — Hermes responds. End-to-end your-side path: Telegram → Hermes → your LiteLLM → vLLM TP=2.
Client side (spark-02)
- spark-02:
curl http://localhost:8001/v1/models -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY"returnsqwen3.5-122bthrough the client's LiteLLM. - spark-02:
curl http://localhost:8001/v1/chat/completions -H "Authorization: Bearer YOUR_CLIENT_MASTER_KEY" ...returns a completion. - spark-02:
sudo systemctl status litellm— active (running). Logs are written to~/sparky-ai-stack/logs/litellm.logonspark-02, separate from your logs. - spark-02:
docker psshowsvllm-ray-worker,open-webui, andn8nall Up. (No Hermes container.) - spark-02: Open
http://localhost:8080(or the client's tailnet hostname), send a chat message — completion streams back. Client's LiteLLM log records the request; your LiteLLM log on spark-01 does not.
Cross-stack isolation (the negative tests)
- From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8001/v1/modelsshould fail (timeout or "no route to host"). Your LiteLLM is not exposed to the client's tailnet. - From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/modelsshould fail. The unauthenticated vLLM endpoint is not reachable from the client's tailnet. - From client's tailnet:
curl http://YOUR_NODE1_TAILSCALE_IP:8080should fail. Your Open WebUI is not reachable from the client's tailnet. - From your tailnet:
curl http://YOUR_NODE2_TAILSCALE_IP:8080should fail. The client's Open WebUI is not reachable from your tailnet. - spark-01:
tail -f ~/sparky-ai-stack/logs/litellm.logwhile the client sends a chat message → your log is silent. Client's traffic does not enter your stack.
DAC traffic during inference
- spark-02:
nload enp1s0f0np0spikes into Gb/s during decoding from either side — both your and the client's inference requests traverse the DAC (yours for NCCL collectives, theirs for both the API call to198.51.100.1:8000and NCCL). - Both sides simultaneously: have you and the client send a long prompt at the same time. Both completions should stream concurrently — vLLM's continuous batching handles the overlap.
Config validation
-
spark-01 (your Hermes config): YAML validation passes — a parse error will silently break Hermes by falling back to
.env:Expected output:bashpython3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"YAML valid
Port reference
spark-01 (your node)
spark-02 (client node)
File locations
spark-01 — your node
├── Dockerfile.vllm-spark # custom image: NGC vllm + ray
├── litellm-config.yaml # YOUR LiteLLM — your master_key, localhost:8000
├── hermes-webui.yml # compose for your hermes-webui
├── n8n.yml # compose for your n8n
├── scripts/
└── vllm-head.sh # Ray head + vllm serve TP=2
└── logs/
├── litellm.db # YOUR SQLite request log
└── litellm.log # YOUR text log
~/.hermes/ # YOUR Hermes config, memory, skills
~/workspace/ # YOUR Hermes file workspace
Docker volumes:
├── open-webui # YOUR Open WebUI database, knowledge bases, RAG
└── n8n_data # YOUR n8n flows + credentials
/etc/systemd/system/
├── litellm.service # YOUR LiteLLM auto-start
├── litellm.service.d/override.conf # PYTHONPATH only
└── hermes-gateway.service # YOUR Telegram gateway auto-start
spark-02 — client node (separate ownership)
├── Dockerfile.vllm-spark # identical custom image as spark-01
├── litellm-config.yaml # CLIENT LiteLLM — client's master_key, points DAC → vLLM
├── n8n.yml # compose for client's n8n
├── scripts/
└── vllm-worker.sh # Ray worker join
└── logs/
├── litellm.db # CLIENT SQLite log — separate corpus
└── litellm.log # CLIENT text log
Docker volumes:
├── open-webui # CLIENT Open WebUI database, knowledge bases, RAG
└── n8n_data # CLIENT n8n flows + credentials
/etc/systemd/system/
└── litellm.service # CLIENT LiteLLM auto-start
Backup targets
| Node / Owner | Path | Contents |
|---|---|---|
| spark-01 — you | ~/sparky-ai-stack/logs/ | Your LiteLLM corpus |
| spark-01 — you | ~/.hermes/ | Your Hermes memory, sessions, skills |
| spark-01 — you | Docker volumes open-webui, n8n_data | Your UI state — chat history, knowledge bases, flows |
| spark-02 — client | ~/sparky-ai-stack/logs/ | Client's LiteLLM corpus (their backup, not yours) |
| spark-02 — client | Docker volumes open-webui, n8n_data | Client's UI state (their backup, not yours) |
Cluster issues
Two-node-specific failures and their resolutions. Most of these were discovered during live deployment.
ray: not found inside the vLLM containerray. ray start errors with "command not found".nvcr.io/nvidia/vllm:26.04-py3 ships without Ray. vLLM 0.19.0+nv26.04 hard-requires Ray for any multi-node inference — torch.distributed.run is not a substitute, because vLLM validates Ray at engine init regardless of launch method.vllm-spark:26.04 image on both nodes (Step 1a) before any cluster launch attempt: FROM nvcr.io/nvidia/vllm:26.04-py3 + RUN pip install ray --quiet.node:192.0.2.x not foundVLLM_HOST_IP is unset, so vLLM resolves the node's hostname to the mgmt IP (192.0.2.x) — but Ray registered each node on its DAC IP (198.51.100.x). The placement group spec then targets a node Ray has never seen.VLLM_HOST_IP to each container's own DAC IP: on spark-01's head container, VLLM_HOST_IP=YOUR_NODE1_DAC_IP; on spark-02's worker container, VLLM_HOST_IP=YOUR_NODE2_DAC_IP. The Step 01 startup scripts set this for you.ray status reports 2.0/2.0 GPU after the worker container's join completes. The Step 01 head script blocks vllm serve behind this check, so under normal operation you won't see it.docker logs vllm-ray-worker on spark-02 prints "head not yet reachable, retrying in 3s..." in a loop while the head is still loading.YOUR_NODE1_DAC_IP:6379. If the head isn't up yet, the worker retries forever. This is the expected path — the Step 01 launch order is "worker first, then head" precisely so that the worker is already retrying when the head comes up.VLLM_HOST_IP on the head was set to a non-DAC IP so Ray bound to the wrong interface, or (c) a firewall blocks port 6379 over the DAC.docker exec vllm-qwen-122b ray status — if Ray isn't running, restart the container. Confirm VLLM_HOST_IP=YOUR_NODE1_DAC_IP. From spark-02: nc -zv YOUR_NODE1_DAC_IP 6379 should connect; if not, check firewall rules on the DAC interface.nload enp1s0f0np0 stays idle; mgmt LAN sees Gb/s spikes instead.NCCL_SOCKET_IFNAME didn't propagate into the container, so NCCL fell back to interface auto-detect and chose mgmt over DAC.NCCL_SOCKET_IFNAME=enp1s0f0np0 and GLOO_SOCKET_IFNAME=enp1s0f0np0 as -e flags on both containers (the Step 01 scripts do this). Also confirm the iface name is identical on both nodes — if one node has it as enp1s0f1np0, NCCL will fail.nvidia-smi memory.used returns [N/A] on GB10nvidia-smi --query-gpu=memory.used,memory.total --format=csv returns [N/A] in both fields. Memory monitoring scripts that depend on these fields silently report nothing.memory.used / memory.total NVML fields are not populated on this hardware.nvidia-smi and read the Processes section. With the 122B GPTQ-Int4 model loaded at --gpu-memory-utilization 0.80, you should see a RayWorkerWrapper process at roughly ~96 GB on each node (≈34 GB weights + KV cache). The 35B FP8 bootstrap model lands closer to ~97 GB.ssh spark-02 from spark-01 hangs or returns "connection refused", even though both nodes are reachable.198.51.100.x). Unless you've explicitly bound sshd to the DAC interface, SSH only listens on the mgmt LAN — but your client has resolved the hostname to the DAC IP./etc/hosts on both nodes (this is in Prerequisites): echo "YOUR_NODE1_MGMT_IP spark-01" | sudo tee -a /etc/hosts and likewise for spark-02. The HF cache rsync in Step 01b uses the DAC IP explicitly, but every other ssh spark-0X command relies on hostname resolution.docker: permission denied on spark-02permission denied while trying to connect to the Docker daemon socket. Common on a freshly-imaged worker node where docker installed cleanly but the operator account isn't in the docker group yet.docker group. On a fresh spark-02: sudo usermod -aG docker cameron && newgrp docker (substitute your actual username). The new group only takes effect after re-logging or running newgrp docker in the current shell. The Step 01 worker script will not run until this is done.httpx.ConnectError on startuphttpx.ConnectError against either Postgres or a model endpoint.STORE_MODEL_IN_DB=True + DATABASE_URL in /etc/systemd/system/litellm.service.d/override.conf still pointing at a Postgres that no longer runs, or (2) litellm-config.yaml includes a model with api_base pointing at a dead port (e.g. localhost:8002 from a previous dual-model setup).PYTHONPATH (see Step 02). Audit litellm-config.yaml for any localhost:8002 or other dead endpoints and remove them — every model in model_list must have a live api_base.RuntimeError: CUDA driver error: operation not permitteddocker exec vllm-qwen-122b rm -rf /tmp/torchinductor_root /root/.cache/vllm/torch_compile_cache/torch_aot_compile && docker restart vllm-qwen-122bconnection refused or connect timeout on every request. ~/sparky-ai-stack/logs/litellm.log on spark-02 shows httpx.ConnectError against 198.51.100.1:8000.ping -c 3 YOUR_NODE1_DAC_IP (DAC reachable?), then nc -zv YOUR_NODE1_DAC_IP 8000. From spark-01: docker ps | grep vllm-qwen-122b (running?) and sudo ufw status | grep 8000. The Step 06 ufw rule should explicitly allow YOUR_NODE2_DAC_IP on port 8000 over enp1s0f0np0.curl http://YOUR_NODE1_TAILSCALE_IP:8000/v1/models succeeds — meaning the client could bypass their LiteLLM and hit the unauthenticated vLLM directly.tailscale0 for port 8000, or (c) ufw is disabled.tag:owner:8000 entry anywhere. Then on spark-01: sudo ufw status verbose — no rule should allow port 8000 on tailscale0. Only the explicit DAC-peer rule (enp1s0f0np0 from YOUR_NODE2_DAC_IP to any port 8000) should appear. If you find a leak, fix the ACL and ufw, then re-run the negative tests in the validation checklist.curl http://YOUR_NODE2_MGMT_IP:8001/v1/models from spark-01 and get the client's LiteLLM, and it works — leaking the client's API surface to your mgmt LAN.0.0.0.0:8001 so each is reachable from its own tailnet. Your firewall must restrict inbound on the mgmt-LAN interface to deny port 8001 cross-node — Tailscale ACLs alone don't help here because the mgmt LAN is not part of any tailnet.tailscale0 only — confirm with sudo ufw status verbose that no rule allows 8001 on the wildcard interface or on the mgmt iface.~/sparky-ai-stack/logs/litellm.log on spark-01 that you didn't make. Or the client reports a request in their log that they didn't send.master_key values are unique on each node — they should never match. Then check each Open WebUI's API base URL: yours should be http://host.docker.internal:8001/v1 (your local LiteLLM); client's should be the same hostname (their local LiteLLM, not your tailnet IP). If you find cross-pointing, fix it and rotate both master keys.Qwen/Qwen3-235B-A22B-FP8 at TP=2 fails. Ray kills the worker with an OutOfMemoryError at the ~95% memory threshold during weight loading; the head logs report a placement-group failure shortly after.AssertionError: block_size (2096) must be <= max_num_batched_tokens (2048)--enable-prefix-caching is on, which sets block_size=2096. The default max_num_batched_tokens=2048 is one block too small.--max-num-batched-tokens 4096 to the vllm serve command (the Step 01 head script does this). 4096 is the smallest power-of-two that satisfies the assertion with headroom.docker logs -f vllm-qwen-122b appears to stall after a line ending in ray_env.py:111. No new output for many minutes. The cluster looks frozen.torch.compile JIT compilation on first launch. The compile happens silently inside the RayWorkerWrapper process after weight loading completes — vLLM doesn't surface a progress line for it.nvidia-smi on each node: both should show a RayWorkerWrapper at roughly ~34 GB during compile and growing toward ~96 GB once KV cache initializes. The compiled artifacts are cached at /root/.cache/vllm/torch_compile_cache/, so subsequent startups skip this step entirely.huggingface-cli is deprecatedhuggingface-cli download … prints a deprecation notice, or the command is missing on a fresh install of huggingface_hub.hf download <model> --local-dir <path>. Step 01 has been updated to use hf; if you have an older snippet around, swap the binary name and the flags map cleanly.permission denied when downloading to the HF cachehf download (or huggingface-cli download) fails with PermissionError: [Errno 13] Permission denied on a path under ~/.cache/huggingface/hub/.sudo or as root, which left the target directory root-owned. The current shell (running as cameron or your normal user) cannot write into it.sudo rm -rf <target_dir>, then recreate the directory and rerun the hf download command as your normal user (not root). All hf download commands in this guide assume the non-root user.preserve_thinking: true (or any other flag that exposes the visible CoT track). The Qwen3.5 family produces extended reasoning whenever thinking is enabled at template time, and conversational queries trigger excessive deliberation under that setting.--default-chat-template-kwargs '{"enable_thinking": false}' on the vllm serve command (the Step 01 head script does this). Users can opt into extended reasoning per-message by prefixing a prompt with /think. This matches the behavior the recommended Open WebUI system prompt (Step 05) is calibrated for.table public.LiteLLM_UserTable does not existLiteLLM_* table in the public schema.prisma db push against the LiteLLM Prisma schema:DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" prisma db push --schema /home/YOUR_USERNAME/.local/lib/python3.12/site-packages/litellm/proxy/schema.prisma — then restart LiteLLM. See Step 2d / 3d.Not connected to DBlocalhost:5432.database_url is at the top level of litellm-config.yaml instead of nested under general_settings — LiteLLM only reads it from general_settings, or (b) the config still points at SQLite. SQLite is not supported for the UI; the Prisma schema is hardcoded for PostgreSQL.general_settings:general_settings:
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"and restart LiteLLM. Confirm with
journalctl -u litellm -n 50 --no-pager — startup should report a successful Postgres connection.LiteLLM_VerificationTokenTableVerificationToken not found error, but the key is still visible in the Virtual Keys tab.prisma db push, which dropped/recreated the verification token table).sudo systemctl restart litellm), then generate a fresh key. The new key will be inserted into the current schema and will validate correctly.Other known issues
[Errno 98] address already in usepkill -f litellm && sudo systemctl restart litellmsource ~/.bashrc or use /home/YOUR_USERNAME/.local/bin/hermessudo: hermes: command not foundsudo /home/YOUR_USERNAME/.local/bin/hermes gateway install --systemnpx not in LiteLLM's PATHnvm or a user-local installer, which places binaries in ~/.local/bin and adds that path to the user's shell rc file (.bashrc). LiteLLM — running as a systemd service — never sources .bashrc, so it gets a clean environment with no ~/.local/bin in PATH and cannot find npx when it tries to spawn the stdio MCP server process.sudo ln -s /home/YOUR_USERNAME/.local/bin/node /usr/local/bin/node and sudo ln -s /home/YOUR_USERNAME/.local/bin/npx /usr/local/bin/npx. Alternatively, install Node system-wide via NodeSource: curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt install -y nodejs. After either fix, retry the health check in the LiteLLM UI — it should pass.~/.hermes/config.yaml, Hermes seems to ignore the changes — MCP is broken and the provider reverts to default. No obvious error in the logs..env values. No error is shown at startup, so the misconfiguration is invisible.python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')". If the command prints anything other than YAML valid, fix the syntax before restarting.mcp_servers or a failed MCP client handshake. Logs reference an asyncio error or a TCP/stdio connection that never opened.mcp_servers block in ~/.hermes/config.yaml — it talks to LiteLLM's MCP endpoint over the LiteLLM API. A mcp_servers block in Hermes config is only correct in a direct-MCP (non-LiteLLM-proxied) topology, and on this stack it points Hermes at servers it can't reach.mcp_servers block from ~/.hermes/config.yaml entirely. Validate the YAML, then restart: systemctl --user restart hermes (or whichever supervisor you use). MCP tool calls continue to work because LiteLLM is in the path.~/.hermes/config.yaml.custom_providers must be indented exactly 2 spaces. A common mistake is using 4 spaces or no indentation.custom_providers:
- name: MyProvider
base_url: http://localhost:8001/v1
model: my-model
Clustered Open WebUI / n8n (HA notes)
Both Open WebUI and n8n have HA modes available, but for a two-node home/lab setup the operational complexity is not worth it. This stack runs them as single instances on spark-02. If you ever want to pursue HA, here are the pointers.
Open WebUI HA
- Switch the
open-webuicontainer's storage from a Docker volume to a Postgres backend (env:DATABASE_URL=postgresql://...) and a shared filesystem for uploads and RAG documents. - Run multiple replicas behind a TCP load balancer. Sticky sessions are recommended for SSE chat streams.
- Postgres can sit on either node; if you put it on
spark-01you'll re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Prefer a third small box or a dedicated HA pair.
n8n HA
- n8n's queue mode requires Postgres for state and Redis for the BullMQ queue. The main container becomes the main instance; one or more worker instances pull jobs off the queue.
- Set
EXECUTIONS_MODE=queue,QUEUE_BULL_REDIS_HOST=…,DB_TYPE=postgresdb, and the relevant Postgres env vars on every container. Replicas need the sameN8N_ENCRYPTION_KEY. - For a two-node setup the simplest variant is one main on
spark-02and one worker on a third small box, with Postgres + Redis colocated on the third box. - Webhook traffic should hit only the main container; long-running executions land on workers transparently.
Architecture
Desktop and mobile clients run Obsidian with Syncthing for bidirectional vault sync. The vault syncs to a dedicated Proxmox LXC over the local network. When traveling, Tailscale bridges the connection — clients queue changes offline and sync when Tailscale is enabled on both ends.
Desktop Client (Obsidian + Syncthing) ─┐
Mobile Client (Obsidian + Syncthing) ─┼─→ Syncthing ─→ LXC /vault
┘ ↓
obsidian-mcp (stdio)
↓
supergateway (streamableHttp, port 3000)
↓
LiteLLM ([YOUR-AI-SERVER-HOSTNAME])
↓
Chat clients / API consumers
LXC setup
| Setting | Value |
|---|---|
| OS | Ubuntu 24.04 LTS |
| Hostname | [YOUR-LXC-HOSTNAME] |
| IP | [YOUR-LXC-IP] on VLAN [YOUR-VLAN-ID] |
| Vault path | /vault with .obsidian/app.json stub (required for MCP server vault validation) |
| Services | syncthing@root and obsidian-mcp — both enabled as systemd services |
Setup script
A single script provisions the full stack from a fresh Ubuntu 24.04 LXC.
#!/bin/bash
set -e
# ── Configuration ─────────────────────────────────────────────
SYNCTHING_USER="root" # Change if running as non-root user
# ──────────────────────────────────────────────────────────────
# 1. Install dependencies
apt update
apt install -y curl gpg apt-transport-https
# 2. Install Node.js 22 + npm
curl -fsSL https://deb.nodesource.com/setup_22.x | bash -
apt install -y nodejs
# 3. Install Syncthing
curl -fsSL https://syncthing.net/release-key.gpg | gpg --dearmor -o /usr/share/keyrings/syncthing-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/syncthing-archive-keyring.gpg] https://apt.syncthing.net/ syncthing stable" > /etc/apt/sources.list.d/syncthing.list
apt update
apt install -y syncthing
# 4. Create vault directory with Obsidian config
mkdir -p /vault/.obsidian
echo '{}' > /vault/.obsidian/app.json
# 5. Enable and start Syncthing
systemctl enable syncthing@${SYNCTHING_USER}
systemctl start syncthing@${SYNCTHING_USER}
# 6. Wait for Syncthing config to generate
sleep 8
# 7. Expose Syncthing GUI on all interfaces
CONFIG_PATH=$(find /root -name "config.xml" 2>/dev/null | grep syncthing | head -1)
sed -i 's|<address>127.0.0.1:8384</address>|<address>0.0.0.0:8384</address>|' "$CONFIG_PATH"
systemctl restart syncthing@${SYNCTHING_USER}
# 8. Install obsidian-mcp and supergateway
npm install -g obsidian-mcp supergateway
# 9. Create obsidian-mcp systemd service
cat > /etc/systemd/system/obsidian-mcp.service << 'EOF'
[Unit]
Description=Obsidian MCP Server
After=network.target
[Service]
Type=simple
User=root
ExecStart=supergateway --stdio "obsidian-mcp /vault" --port 3000 --outputTransport streamableHttp --stateful --protocolVersion 2025-11-25
Restart=on-failure
RestartSec=10
MemoryMax=512M
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable obsidian-mcp
systemctl start obsidian-mcp
echo "Syncthing UI: http://$(hostname -I | awk '{print $1}'):8384"
echo "MCP endpoint: http://$(hostname -I | awk '{print $1}'):3000/mcp"
Manual steps after script
- Open Syncthing UI at
http://[YOUR-LXC-IP]:8384 - Remove the default folder, add
/vault, set a GUI password - Pair each client device — add LXC device ID in client Syncthing UI, accept the incoming request on the LXC side, share the vault folder
- In LiteLLM UI → MCP Servers → Add: URL
http://[YOUR-LXC-IP]:3000/mcp, transporthttp, authNone - In Hermes
~/.hermes/config.yaml, add undermcp_servers:obsidian: url: http://[YOUR-LXC-IP]:3000/mcpandtransport: streamableHttp -
Mandatory verification — run this before proceeding. If it fails, the MCP server is not reachable and the next steps will not work:
Expected output:bash
hermes mcp test obsidian✓ Connectedand✓ Tools discovered: 11. If this fails, check that the obsidian-mcp systemd service is running on the LXC (systemctl status obsidian-mcp), that port 3000 is reachable from the DGX, and that the URL in config.yaml is correct. -
Update
~/.hermes/skills/note-taking/obsidian/SKILL.mdwith the correct MCP tool names (create the directory if it doesn't exist). The built-in Hermes obsidian skill will use filesystem commands unless this file explicitly names the MCP tools:bashmkdir -p ~/.hermes/skills/note-taking/obsidian cat > ~/.hermes/skills/note-taking/obsidian/SKILL.md << 'EOF' # Obsidian Vault Skill Always use MCP tools for all vault operations. Never use filesystem commands. ## Required: discover vault name first Always call `list-available-vaults` before any other tool to discover the vault name. Do not assume a vault name — always look it up. ## Available MCP tools - list-available-vaults — discover vault name (call first, every time) - search-vault — full-text search across all notes - read-note — read a specific note by path - create-note — create a new note - edit-note — edit an existing note - move-note — move or rename a note - delete-note — delete a note - create-directory — create a folder in the vault - add-tags — add tags to a note - remove-tags — remove tags from a note - rename-tag — rename a tag across all notes EOF - Add
mcpto thetelegramplatform toolset in~/.hermes/config.yamland restarthermes-gateway
Services reference
| Service | Command | Port |
|---|---|---|
| Syncthing | systemctl status syncthing@root | 8384 |
| Obsidian MCP | systemctl status obsidian-mcp | 3000 |
LiteLLM MCP configuration
| Field | Value |
|---|---|
| MCP Server URL | http://[YOUR-LXC-IP]:3000/mcp |
| Transport | http |
| Auth | None |
Bugs & fixes encountered
/root/.local/share/syncthing/config.xml path fails — file not found./root/.local/state/syncthing/config.xml, not /root/.local/share/syncthing/config.xml as documented elsewhere. The setup script uses find to locate the actual path dynamically..obsidian/app.jsonError: Not a valid Obsidian vault when starting obsidian-mcp against an empty directory.obsidian-mcp requires a .obsidian directory with app.json present to consider a directory a valid vault. An empty directory fails. Fix: mkdir -p /vault/.obsidian && echo '{}' > /vault/.obsidian/app.json.Connection closed errors when LiteLLM tries to connect to the MCP server using SSE transport.2025-11-25. SSE transport is incompatible with this version.--outputTransport streamableHttp in the supergateway command and set transport to http in the LiteLLM UI.obsidian-mcp child processes accumulate under load, hitting pthread_create: Resource temporarily unavailable and crashing the server.obsidian-mcp child process per HTTP request. Under load these accumulate and exhaust thread limits.--stateful flag to keep one persistent child process instead of spawning per request.protocolVersion: 2025-11-25, server responds with 2024-11-05, then LiteLLM sends SIGTERM and closes the session.--protocolVersion 2025-11-25 to the supergateway command so the handshake version matches what LiteLLM expects.platform_toolsets for telegram only includes hermes-telegram by default — MCP tools are not loaded for Telegram sessions.mcp to the telegram list in ~/.hermes/config.yaml and restart hermes-gateway.~/.hermes/config.yaml. Or worse — Hermes starts but silently falls back to .env values, breaking MCP and provider config with no obvious error.python3 -c "import yaml; yaml.safe_load(open('/home/YOUR_USERNAME/.hermes/config.yaml')); print('YAML valid')"
YAML valid. If an exception is raised instead, fix the indentation and re-run until the validation passes before restarting any service.cat, find) to read vault files unless the SKILL.md explicitly references the correct MCP tool names. Adding the MCP server to config.yaml alone is not sufficient — the skill must also be rewritten to use the MCP tools exposed by obsidian-mcp.hermes mcp test obsidian first to confirm connectivity and discover the exact tool names. Then update ~/.hermes/skills/note-taking/obsidian/SKILL.md to explicitly list the MCP tool names: list-available-vaults, search-vault, read-note, create-note, edit-note, move-note, delete-note, create-directory, add-tags, remove-tags, rename-tag. The SKILL.md must instruct Hermes to always call list-available-vaults first before any other tool call. See the setup steps above for the full SKILL.md content.Web Search Tool Use
Adds live web search as a callable tool — the AI model can run Brave Search queries during a conversation in response to tool calls from Open WebUI and other clients. LiteLLM spawns the server on demand via stdio using an API key you supply.
Step 1 — Get a Brave Search API key
Go to api.search.brave.com, create a free account, and generate an API key under the Data for AI plan (free tier supports up to 2,000 queries/month).
Step 2 — Confirm Node.js is installed system-wide
The MCP server is launched via npx. If you completed the Playwright MCP setup, Node.js is already installed system-wide and this step is done. Otherwise:
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
which npx — expected output: /usr/bin/npxStep 3 — Add Brave Search MCP server in LiteLLM UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)
| Field | Value |
|---|---|
| Name | brave-search |
| Alias | brave-search |
| Transport Type | Standard Input/Output (stdio) |
Set Stdio Configuration (JSON) — replace YOUR_BRAVE_API_KEY with your actual key:
{
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-brave-search"
],
"env": {
"BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
}
}
Save and confirm Health Status shows Healthy.
npx resolves to /usr/bin/npx (system-wide install) and not a user-local path. See the Known Issues section — MCP stdio servers fail health check — for the full diagnosis.Validation
In Open WebUI, send the following prompt:
Expected: the model calls the brave_web_search tool (shown as "Explored" in Open WebUI) and returns a summary drawn from live search results.
Browser Automation
Adds browser automation tool use to the stack — the AI model can navigate pages, take screenshots, and scrape content via tool calls in Open WebUI and other clients. LiteLLM spawns a headless Chromium process on demand via stdio; no persistent port is required.
Step 1 — Install Node.js system-wide
LiteLLM runs as a systemd service and does not source .bashrc. Node.js must be installed system-wide so npx is available in LiteLLM's PATH.
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
which npx — expected output: /usr/bin/npxnvm or a user-local installer, this step replaces it with a system-wide install. The Known Issues section documents the npx PATH problem in detail.Step 2 — Install Playwright MCP Chromium browser
Chrome has no ARM64 build. Use Chromium, installed via the @playwright/mcp package's own browser installer — not via npx playwright install:
npx @playwright/mcp install-browser chromium
ls ~/.cache/ms-playwright/ — expected: a chromium-XXXX directory is present.Step 3 — Update litellm-config.yaml
Add model_info blocks to all model entries. Without these, LiteLLM does not advertise function calling support and tool calls will not execute.
model_list:
- model_name: qwen3.5-122b
litellm_params:
model: openai/qwen3.5-122b
api_base: http://localhost:8000/v1
api_key: "not-needed"
max_tokens: 8192
model_info:
supports_function_calling: true
supports_tool_choice: true
Restart LiteLLM after saving:
sudo systemctl restart litellm
Step 4 — Add Playwright MCP server in LiteLLM UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui → MCP Servers → Add New MCP Server. (LiteLLM lives on spark-01.)
| Field | Value |
|---|---|
| Name | playwright |
| Alias | playwright |
| Transport Type | Standard Input/Output (stdio) |
Set Stdio Configuration (JSON):
{
"command": "npx",
"args": [
"-y",
"@playwright/mcp@latest",
"--browser",
"chromium",
"--headless"
]
}
Save and confirm Health Status shows Healthy.
Validation
In Open WebUI, send the following prompt:
Expected: the model calls the navigate and screenshot tools (shown as "Explored" in Open WebUI) and returns a summary of the page.
LiteLLM Admin UI
The LiteLLM proxy ships with a built-in web UI at /ui. It requires a master key and a PostgreSQL database — SQLite is not supported for the UI auth layer. The following documents every error encountered during setup, in order.
Step 1 — Set a master key
All commands below run on spark-01 (where LiteLLM lives). Add to ~/sparky-ai-stack/litellm-config.yaml:
general_settings:
master_key: sk-yourkey
database_url: "postgresql://litellm:litellm@localhost:5432/litellm"
Generate a secure key:
echo "sk-$(openssl rand -hex 16)"
Step 2 — Add PostgreSQL to a compose file on spark-01
Run Postgres on the same node as LiteLLM. Putting it on spark-02 would re-introduce the very latency-disturbance pattern this architecture is designed to avoid. Add to a new ~/sparky-ai-stack/litellm-db.yml on spark-01:
litellm-db:
image: postgres:16
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
ports:
- "5432:5432"
volumes:
- litellm_db:/var/lib/postgresql/data
volumes:
litellm_db:
docker compose up -d litellm-db
restart: unless-stopped combined with sudo systemctl enable docker ensures the container survives reboots automatically — no additional systemd unit needed.Step 3 — Install Prisma
LiteLLM uses Prisma as its database ORM. It is not included in the base pip install litellm package.
pip install prisma --break-system-packages
--break-system-packages bypasses a Python 3.12 restriction that prevents pip from installing into the system Python environment. It is safe on a dedicated AI server where system tools do not depend on conflicting packages.
Step 4 — Generate Prisma binaries
After installing the package, the binaries must be generated from LiteLLM's bundled schema:
cd ~/.local/lib/python3.12/site-packages/litellm/proxy
prisma generate --schema schema.prisma
Step 5 — Apply the database schema
The Postgres database exists but has no tables yet. Push the schema. DATABASE_URL must be passed inline — Prisma reads it directly from the environment, not from litellm-config.yaml.
DATABASE_URL="postgresql://litellm:litellm@localhost:5432/litellm" \
prisma db push --schema schema.prisma
Step 6 — Restart LiteLLM
sudo systemctl daemon-reload
sudo systemctl restart litellm
sudo systemctl status litellm
Errors encountered in order
| Error | Cause | Fix |
|---|---|---|
Authentication Error, Not connected to DB | No PostgreSQL configured | Add database_url to general_settings |
ModuleNotFoundError: No module named 'prisma' | Prisma not installed | pip install prisma --break-system-packages |
Unable to find Prisma binaries | prisma generate not run | Run prisma generate --schema schema.prisma |
The table 'public.LiteLLM_UserTable' does not exist | Schema not applied to DB | Run prisma db push --schema schema.prisma |
Accessing the UI
Navigate to http://YOUR_NODE1_MGMT_IP:8001/ui. Username: admin. Password: your master_key value.